jan-karel.com
Home / Security Measures / Web Security / Input Validation & Output Encoding

Input Validation & Output Encoding

Input Validation & Output Encoding

Input Validation & Output Encoding

Code With Boundaries, Production With Confidence

Web risk is rarely mysterious. It usually lies in predictable mistakes that persist under time pressure.

With Input Validation & Output Encoding, the biggest gains come from secure defaults that are automatically enforced in every release.

That makes security less of a separate check after the fact and more of a standard quality of your product.

Immediate measures (15 minutes)

Why this matters

The core of Input Validation & Output Encoding is risk reduction in practice. Technical context supports the measure selection, but implementation and assurance are central.

Core Principles

Validate input, encode output

┌──────────┐     ┌────────────┐     ┌──────────┐     ┌──────────────┐
│  Input   │────▶│ Validation │────▶│ Business │────▶│ Output       │────▶ Output
│ (untrust)│     │ (allowlist)│     │  Logic   │     │ encoding     │
└──────────┘     └────────────┘     └──────────┘     └──────────────┘
  • Input validation is generic: "Is this a valid email address? A number between 1 and 100?"
  • Output encoding is context-specific: "Am I placing this value in HTML, JavaScript, SQL, or a URL?"

Never trust

All input is untrusted. Not just form fields, but also:

  • HTTP headers (Host, Referer, User-Agent, X-Forwarded-For)
  • Cookies
  • URL parameters and path segments
  • Filenames in uploads
  • API responses from external services
  • Database content (may have been injected earlier)

Input Validation

Allowlist over blocklist

# WRONG — blocklist: try to block known bad patterns
def sanitize_input(value):
    blacklist = ['<script>', 'DROP TABLE', '../', ';']
    for bad in blacklist:
        value = value.replace(bad, '')
    return value  # Endlessly bypassable

# RIGHT — allowlist: define what IS allowed
import re

def validate_username(value):
    if not re.fullmatch(r'[a-zA-Z0-9_]{3,30}', value):
        raise ValueError("Invalid username")
    return value

A blocklist is a race you always lose. There are infinitely many ways to encode malicious input. An allowlist defines the finite set of valid values.

Type, range, and format

# Type validation
def validate_age(value):
    age = int(value)          # TypeError if not a number
    if not 0 <= age <= 150:   # Range check
        raise ValueError("Age out of range")
    return age

# Format validation with regex
import re

def validate_dutch_postcode(value):
    if not re.fullmatch(r'\d{4}\s?[A-Z]{2}', value):
        raise ValueError("Invalid postal code")
    return value.replace(' ', '')  # Normalize to '1234AB'

# Email: use a library, don't write your own regex
from email_validator import validate_email

def validate_email_address(value):
    result = validate_email(value)
    return result.normalized

Unicode normalization

Unicode offers multiple representations for the same character. Without normalization, identical-looking strings can be different:

import unicodedata

# 'café' can be encoded in two ways:
nfc = unicodedata.normalize('NFC', user_input)   # Composed: é
nfkc = unicodedata.normalize('NFKC', user_input) # Compatible: fi → fi

# Normalize BEFORE validation
def validate_name(value):
    value = unicodedata.normalize('NFC', value)
    if not re.fullmatch(r'[\w\s\-]{1,100}', value, re.UNICODE):
        raise ValueError("Invalid name")
    return value

Rule: Normalize Unicode before you validate, and validate before you store. This prevents bypasses via homoglyphs (Cyrillic a vs Latin a) and width variants.

Length limitation

Always limit the length of input. This prevents:

  • Buffer overflows
  • ReDoS (Regular Expression Denial of Service)
  • Database overflow
  • Resource exhaustion
MAX_COMMENT_LENGTH = 5000

def validate_comment(value):
    if len(value) > MAX_COMMENT_LENGTH:
        raise ValueError(f"Comment too long (max {MAX_COMMENT_LENGTH} characters)")
    return value.strip()

Output encoding per context

The correct encoding depends on where you place the data. This is the most critical lesson: there is no universal sanitize function.

Context matrix

Output context Encoding method Example
HTML body HTML entity encoding <&lt;
HTML attribute HTML entity encoding + quotes "&quot;
JavaScript string JavaScript string escaping '\', \n\\n
URL parameter Percent-encoding %20, &%26
CSS value CSS escaping \\\, (\28
SQL query Parameterized queries No encoding — use placeholders
JSON JSON serialization Use json.dumps(), never string concatenation
Command line No encoding — use arrays No shell, pass args as list

HTML entity encoding

# Python — standard library
import html

user_input = '<script>alert("XSS")</script>'
safe = html.escape(user_input)
# &lt;script&gt;alert(&quot;XSS&quot;)&lt;/script&gt;
// Java — OWASP Java Encoder
import org.owasp.encoder.Encode;

String safe = Encode.forHtml(userInput);
String safeAttr = Encode.forHtmlAttribute(userInput);
String safeJs = Encode.forJavaScript(userInput);
// JavaScript (server-side Node.js)
const he = require('he');

const safe = he.encode(userInput);
// PHP
$safe = htmlspecialchars($userInput, ENT_QUOTES | ENT_HTML5, 'UTF-8');
// C# — System.Text.Encodings.Web
using System.Text.Encodings.Web;

string safe = HtmlEncoder.Default.Encode(userInput);
string safeJs = JavaScriptEncoder.Default.Encode(userInput);
string safeUrl = UrlEncoder.Default.Encode(userInput);

JavaScript string escaping

# Never this:
f"var name = '{user_input}';"  # XSS via '; alert(1); //

# Do this instead:
import json
f"var name = {json.dumps(user_input)};"  # Safely escaped

URL encoding

from urllib.parse import quote, urlencode

# Single parameter
safe_param = quote(user_input)

# Multiple parameters
params = urlencode({'search': user_input, 'page': '1'})
url = f"https://example.com/search?{params}"

SQL — always parameterized queries

# WRONG — string concatenation
cursor.execute(f"SELECT * FROM users WHERE name = '{name}'")

# RIGHT — parameterized
cursor.execute("SELECT * FROM users WHERE name = %s", (name,))
// RIGHT — PreparedStatement
PreparedStatement stmt = conn.prepareStatement(
    "SELECT * FROM users WHERE name = ?");
stmt.setString(1, name);
// RIGHT — SqlParameter
using var cmd = new SqlCommand(
    "SELECT * FROM users WHERE name = @name", conn);
cmd.Parameters.AddWithValue("@name", name);

JSON serialization

import json

# WRONG — manual construction
response = '{"name": "' + user_input + '"}'

# RIGHT — json.dumps escapes automatically
response = json.dumps({"name": user_input})

Command line — never shell=True

import subprocess

# WRONG — command injection via shell
subprocess.run(f"convert {filename} output.png", shell=True)

# RIGHT — arguments as list, no shell
subprocess.run(["convert", filename, "output.png"])

Libraries per language

Language Library Functionality
JavaScript DOMPurify HTML sanitization (client-side)
JavaScript he HTML entity encode/decode
Python bleach HTML sanitization (server-side)
Python html.escape Basic HTML escaping
Python markupsafe Jinja2 auto-escaping
Java OWASP Java Encoder Context-specific encoding
Java jsoup HTML sanitization + parsing
Go html/template Auto-escaping templates
Go bluemonday HTML sanitization
C# HtmlSanitizer HTML sanitization
C# System.Text.Encodings.Web HTML/JS/URL encoding
PHP htmlspecialchars HTML escaping (built-in)
PHP HTMLPurifier HTML sanitization

DOMPurify (JavaScript, client-side)

// HTML sanitization with DOMPurify
const clean = DOMPurify.sanitize(userInput);

// With configuration — only allow certain tags
const clean = DOMPurify.sanitize(userInput, {
  ALLOWED_TAGS: ['b', 'i', 'em', 'strong', 'a', 'p', 'br'],
  ALLOWED_ATTR: ['href'],
});

bleach (Python, server-side)

import bleach

# Basic sanitization
clean = bleach.clean(user_input)

# With allowlist
clean = bleach.clean(
    user_input,
    tags=['b', 'i', 'em', 'strong', 'a', 'p', 'br', 'ul', 'ol', 'li'],
    attributes={'a': ['href', 'title']},
    protocols=['https'],
)

Pitfalls

Double encoding

# user_input = "&lt;script&gt;"
# First time: already encoded
html.escape(user_input)
# Result: "&amp;lt;script&amp;gt;" — double encoded, visible as &lt;script&gt;

Solution: encode at one place, as late as possible (at the output).

Template engines and auto-escaping

Most modern template engines escape automatically:

Template engine Auto-escape by default? Bypass syntax
Jinja2 (Flask) Yes {{ value\|safe }} or {% autoescape false %}
Django templates Yes {{ value\|safe }} or {% autoescape off %}
Go html/template Yes template.HTML(value)
Thymeleaf (Java) Yes th:utext (unescaped)
Razor (C#) Yes @Html.Raw(value)
ERB (Ruby) No (default) <%= value %> escaped with h()
PHP No Manual htmlspecialchars()

Rule: Use |safe, Raw(), utext and similar bypass mechanisms only on values that you generated yourself or have already sanitized. Never on user input.

Mixed contexts

<!-- DANGEROUS — JavaScript in an HTML attribute -->
<a href="#" onclick="doSomething('{{ user_input }}')">Click</a>

Here you are in two contexts simultaneously: HTML attribute and JavaScript. You must first JavaScript-escape, then HTML-attribute-escape. This is error-prone and should be avoided. Use this instead:

<a href="#" id="action-link" data-value="{{ user_input }}">Click</a>
<script>
  document.getElementById('action-link').addEventListener('click', function() {
    doSomething(this.dataset.value);
  });
</script>

Validation at system boundaries

API endpoints

from pydantic import BaseModel, Field, validator

class CreateUserRequest(BaseModel):
    username: str = Field(min_length=3, max_length=30, pattern=r'^[a-zA-Z0-9_]+$')
    email: str = Field(max_length=254)
    age: int = Field(ge=18, le=150)

    @validator('email')
    def validate_email(cls, v):
        # Use a library for email validation
        if '@' not in v or '.' not in v.split('@')[1]:
            raise ValueError('Invalid email address')
        return v.lower()

@app.post('/api/users')
def create_user(data: CreateUserRequest):
    # data is validated by Pydantic
    ...

Database layer

# Limit query lengths
MAX_SEARCH_LENGTH = 200

def search_products(query: str):
    query = query[:MAX_SEARCH_LENGTH].strip()
    return db.execute(
        "SELECT * FROM products WHERE name LIKE %s LIMIT 50",
        (f"%{query}%",)
    )

File system

import os

UPLOAD_DIR = '/var/www/uploads'

def safe_save(filename: str, content: bytes):
    # Remove path components
    filename = os.path.basename(filename)

    # Allowlist file extensions
    allowed_ext = {'.pdf', '.png', '.jpg', '.docx'}
    _, ext = os.path.splitext(filename)
    if ext.lower() not in allowed_ext:
        raise ValueError(f"File type {ext} not allowed")

    # Generate a safe filename
    import uuid
    safe_name = f"{uuid.uuid4().hex}{ext.lower()}"

    # Verify that the path stays within UPLOAD_DIR
    full_path = os.path.join(UPLOAD_DIR, safe_name)
    if not os.path.realpath(full_path).startswith(os.path.realpath(UPLOAD_DIR)):
        raise ValueError("Path traversal detected")

    with open(full_path, 'wb') as f:
        f.write(content)
    return safe_name

CLI parameters

import subprocess
import shlex

# WRONG — shell injection
def run_tool(target):
    subprocess.run(f"nmap {target}", shell=True)

# RIGHT — arguments as list
def run_tool(target):
    # Validate first
    import re
    if not re.fullmatch(r'[\w.\-:]+', target):
        raise ValueError("Invalid target")
    subprocess.run(["nmap", target])

Checklist

Measure Description Priority
Allowlist validation Define what is allowed, block the rest Critical
Type and range checks Number is number, date is date Critical
Length limitation Maximum length on all input fields High
Unicode normalization NFC/NFKC before validation High
Parameterized queries Never string concatenation in SQL Critical
Template auto-escaping Make sure it's enabled and don't bypass unnecessarily Critical
Context-specific encoding Use the correct encoding per sink Critical
Filename sanitization os.path.basename() + allowlist extensions High
Command arguments as list subprocess.run(["cmd", arg]), never shell=True Critical
API schema validation Pydantic, JSON Schema, or equivalent High

It's actually quite simple. You need two rules. Two.

Rule one: trust nothing that comes from outside. Not the form field, not the URL, not the header, not the cookie, not the file, not the API response from the "trusted partner" whose system you pentested last year and that had three critical SQL injections at the time.

Rule two: when data leaves your system — to the browser, the database, the file system, the command line — encode it for that specific context. HTML in HTML, JavaScript in JavaScript, SQL via parameters.

Two rules. That's it. And yet SQL injection and XSS have existed for more than twenty-five years. We haven't solved them. We haven't even reduced them. They're still in the OWASP Top 10. They were in the first OWASP Top 10, in 2003. Twenty-three years ago.

The solution is known. The tools exist. The libraries are free. The documentation is excellent. But somewhere between "we know how to do it" and "we actually do it" there's a gap so wide that you could fit a data center in it. And in that gap there's a Post-it note with "TODO: input sanitization" that's been there since the first sprint.

Summary

Input validation and output encoding are the fundamental defenses against injection attacks. Validate with allowlists, restrict types and lengths, and normalize Unicode before processing. Encode every output for the specific context: HTML entities for HTML, parameterized queries for SQL, lists for command-line arguments.

In the next chapter, we cover the transport layer: how do you configure TLS so that all that carefully validated and encoded data also travels securely over the network?

Op de hoogte blijven?

Ontvang maandelijks cybersecurity-inzichten in je inbox.

← Web Security ← Home