Input Validation & Output Encoding
Code With Boundaries, Production With Confidence
Web risk is rarely mysterious. It usually lies in predictable mistakes that persist under time pressure.
With Input Validation & Output Encoding, the biggest gains come from secure defaults that are automatically enforced in every release.
That makes security less of a separate check after the fact and more of a standard quality of your product.
Immediate measures (15 minutes)
Why this matters
The core of Input Validation & Output Encoding is risk reduction in practice. Technical context supports the measure selection, but implementation and assurance are central.
Core Principles
Validate input, encode output
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐
│ Input │────▶│ Validation │────▶│ Business │────▶│ Output │────▶ Output
│ (untrust)│ │ (allowlist)│ │ Logic │ │ encoding │
└──────────┘ └────────────┘ └──────────┘ └──────────────┘
- Input validation is generic: "Is this a valid email address? A number between 1 and 100?"
- Output encoding is context-specific: "Am I placing this value in HTML, JavaScript, SQL, or a URL?"
Never trust
All input is untrusted. Not just form fields, but also:
- HTTP headers (Host, Referer, User-Agent, X-Forwarded-For)
- Cookies
- URL parameters and path segments
- Filenames in uploads
- API responses from external services
- Database content (may have been injected earlier)
Input Validation
Allowlist over blocklist
# WRONG — blocklist: try to block known bad patterns
def sanitize_input(value):
blacklist = ['<script>', 'DROP TABLE', '../', ';']
for bad in blacklist:
value = value.replace(bad, '')
return value # Endlessly bypassable
# RIGHT — allowlist: define what IS allowed
import re
def validate_username(value):
if not re.fullmatch(r'[a-zA-Z0-9_]{3,30}', value):
raise ValueError("Invalid username")
return valueA blocklist is a race you always lose. There are infinitely many ways to encode malicious input. An allowlist defines the finite set of valid values.
Type, range, and format
# Type validation
def validate_age(value):
age = int(value) # TypeError if not a number
if not 0 <= age <= 150: # Range check
raise ValueError("Age out of range")
return age
# Format validation with regex
import re
def validate_dutch_postcode(value):
if not re.fullmatch(r'\d{4}\s?[A-Z]{2}', value):
raise ValueError("Invalid postal code")
return value.replace(' ', '') # Normalize to '1234AB'
# Email: use a library, don't write your own regex
from email_validator import validate_email
def validate_email_address(value):
result = validate_email(value)
return result.normalizedUnicode normalization
Unicode offers multiple representations for the same character. Without normalization, identical-looking strings can be different:
import unicodedata
# 'café' can be encoded in two ways:
nfc = unicodedata.normalize('NFC', user_input) # Composed: é
nfkc = unicodedata.normalize('NFKC', user_input) # Compatible: fi → fi
# Normalize BEFORE validation
def validate_name(value):
value = unicodedata.normalize('NFC', value)
if not re.fullmatch(r'[\w\s\-]{1,100}', value, re.UNICODE):
raise ValueError("Invalid name")
return valueRule: Normalize Unicode before you validate, and validate before you store. This prevents bypasses via homoglyphs (Cyrillic a vs Latin a) and width variants.
Length limitation
Always limit the length of input. This prevents:
- Buffer overflows
- ReDoS (Regular Expression Denial of Service)
- Database overflow
- Resource exhaustion
MAX_COMMENT_LENGTH = 5000
def validate_comment(value):
if len(value) > MAX_COMMENT_LENGTH:
raise ValueError(f"Comment too long (max {MAX_COMMENT_LENGTH} characters)")
return value.strip()Output encoding per context
The correct encoding depends on where you place the data. This is the most critical lesson: there is no universal sanitize function.
Context matrix
| Output context | Encoding method | Example |
|---|---|---|
| HTML body | HTML entity encoding | < → < |
| HTML attribute | HTML entity encoding + quotes | " → " |
| JavaScript string | JavaScript string escaping | ' → \', \n →
\\n |
| URL parameter | Percent-encoding | → %20, & →
%26 |
| CSS value | CSS escaping | \ → \\, ( →
\28 |
| SQL query | Parameterized queries | No encoding — use placeholders |
| JSON | JSON serialization | Use json.dumps(), never string concatenation |
| Command line | No encoding — use arrays | No shell, pass args as list |
HTML entity encoding
# Python — standard library
import html
user_input = '<script>alert("XSS")</script>'
safe = html.escape(user_input)
# <script>alert("XSS")</script>// Java — OWASP Java Encoder
import org.owasp.encoder.Encode;
String safe = Encode.forHtml(userInput);
String safeAttr = Encode.forHtmlAttribute(userInput);
String safeJs = Encode.forJavaScript(userInput);// C# — System.Text.Encodings.Web
using System.Text.Encodings.Web;
string safe = HtmlEncoder.Default.Encode(userInput);
string safeJs = JavaScriptEncoder.Default.Encode(userInput);
string safeUrl = UrlEncoder.Default.Encode(userInput);JavaScript string escaping
# Never this:
f"var name = '{user_input}';" # XSS via '; alert(1); //
# Do this instead:
import json
f"var name = {json.dumps(user_input)};" # Safely escapedURL encoding
from urllib.parse import quote, urlencode
# Single parameter
safe_param = quote(user_input)
# Multiple parameters
params = urlencode({'search': user_input, 'page': '1'})
url = f"https://example.com/search?{params}"SQL — always parameterized queries
# WRONG — string concatenation
cursor.execute(f"SELECT * FROM users WHERE name = '{name}'")
# RIGHT — parameterized
cursor.execute("SELECT * FROM users WHERE name = %s", (name,))// RIGHT — PreparedStatement
PreparedStatement stmt = conn.prepareStatement(
"SELECT * FROM users WHERE name = ?");
stmt.setString(1, name);// RIGHT — SqlParameter
using var cmd = new SqlCommand(
"SELECT * FROM users WHERE name = @name", conn);
cmd.Parameters.AddWithValue("@name", name);JSON serialization
import json
# WRONG — manual construction
response = '{"name": "' + user_input + '"}'
# RIGHT — json.dumps escapes automatically
response = json.dumps({"name": user_input})Command line — never shell=True
import subprocess
# WRONG — command injection via shell
subprocess.run(f"convert {filename} output.png", shell=True)
# RIGHT — arguments as list, no shell
subprocess.run(["convert", filename, "output.png"])Libraries per language
| Language | Library | Functionality |
|---|---|---|
| JavaScript | DOMPurify | HTML sanitization (client-side) |
| JavaScript | he | HTML entity encode/decode |
| Python | bleach | HTML sanitization (server-side) |
| Python | html.escape | Basic HTML escaping |
| Python | markupsafe | Jinja2 auto-escaping |
| Java | OWASP Java Encoder | Context-specific encoding |
| Java | jsoup | HTML sanitization + parsing |
| Go | html/template | Auto-escaping templates |
| Go | bluemonday | HTML sanitization |
| C# | HtmlSanitizer | HTML sanitization |
| C# | System.Text.Encodings.Web | HTML/JS/URL encoding |
| PHP | htmlspecialchars | HTML escaping (built-in) |
| PHP | HTMLPurifier | HTML sanitization |
DOMPurify (JavaScript, client-side)
// HTML sanitization with DOMPurify
const clean = DOMPurify.sanitize(userInput);
// With configuration — only allow certain tags
const clean = DOMPurify.sanitize(userInput, {
ALLOWED_TAGS: ['b', 'i', 'em', 'strong', 'a', 'p', 'br'],
ALLOWED_ATTR: ['href'],
});bleach (Python, server-side)
import bleach
# Basic sanitization
clean = bleach.clean(user_input)
# With allowlist
clean = bleach.clean(
user_input,
tags=['b', 'i', 'em', 'strong', 'a', 'p', 'br', 'ul', 'ol', 'li'],
attributes={'a': ['href', 'title']},
protocols=['https'],
)Pitfalls
Double encoding
# user_input = "<script>"
# First time: already encoded
html.escape(user_input)
# Result: "&lt;script&gt;" — double encoded, visible as <script>Solution: encode at one place, as late as possible (at the output).
Template engines and auto-escaping
Most modern template engines escape automatically:
| Template engine | Auto-escape by default? | Bypass syntax |
|---|---|---|
| Jinja2 (Flask) | Yes | {{ value\|safe }} or
{% autoescape false %} |
| Django templates | Yes | {{ value\|safe }} or
{% autoescape off %} |
| Go html/template | Yes | template.HTML(value) |
| Thymeleaf (Java) | Yes | th:utext (unescaped) |
| Razor (C#) | Yes | @Html.Raw(value) |
| ERB (Ruby) | No (default) | <%= value %> escaped with h() |
| PHP | No | Manual htmlspecialchars() |
Rule: Use
|safe,Raw(),utextand similar bypass mechanisms only on values that you generated yourself or have already sanitized. Never on user input.
Mixed contexts
<!-- DANGEROUS — JavaScript in an HTML attribute -->
<a href="#" onclick="doSomething('{{ user_input }}')">Click</a>Here you are in two contexts simultaneously: HTML attribute and JavaScript. You must first JavaScript-escape, then HTML-attribute-escape. This is error-prone and should be avoided. Use this instead:
<a href="#" id="action-link" data-value="{{ user_input }}">Click</a>
<script>
document.getElementById('action-link').addEventListener('click', function() {
doSomething(this.dataset.value);
});
</script>Validation at system boundaries
API endpoints
from pydantic import BaseModel, Field, validator
class CreateUserRequest(BaseModel):
username: str = Field(min_length=3, max_length=30, pattern=r'^[a-zA-Z0-9_]+$')
email: str = Field(max_length=254)
age: int = Field(ge=18, le=150)
@validator('email')
def validate_email(cls, v):
# Use a library for email validation
if '@' not in v or '.' not in v.split('@')[1]:
raise ValueError('Invalid email address')
return v.lower()
@app.post('/api/users')
def create_user(data: CreateUserRequest):
# data is validated by Pydantic
...Database layer
# Limit query lengths
MAX_SEARCH_LENGTH = 200
def search_products(query: str):
query = query[:MAX_SEARCH_LENGTH].strip()
return db.execute(
"SELECT * FROM products WHERE name LIKE %s LIMIT 50",
(f"%{query}%",)
)File system
import os
UPLOAD_DIR = '/var/www/uploads'
def safe_save(filename: str, content: bytes):
# Remove path components
filename = os.path.basename(filename)
# Allowlist file extensions
allowed_ext = {'.pdf', '.png', '.jpg', '.docx'}
_, ext = os.path.splitext(filename)
if ext.lower() not in allowed_ext:
raise ValueError(f"File type {ext} not allowed")
# Generate a safe filename
import uuid
safe_name = f"{uuid.uuid4().hex}{ext.lower()}"
# Verify that the path stays within UPLOAD_DIR
full_path = os.path.join(UPLOAD_DIR, safe_name)
if not os.path.realpath(full_path).startswith(os.path.realpath(UPLOAD_DIR)):
raise ValueError("Path traversal detected")
with open(full_path, 'wb') as f:
f.write(content)
return safe_nameCLI parameters
import subprocess
import shlex
# WRONG — shell injection
def run_tool(target):
subprocess.run(f"nmap {target}", shell=True)
# RIGHT — arguments as list
def run_tool(target):
# Validate first
import re
if not re.fullmatch(r'[\w.\-:]+', target):
raise ValueError("Invalid target")
subprocess.run(["nmap", target])Checklist
| Measure | Description | Priority |
|---|---|---|
| Allowlist validation | Define what is allowed, block the rest | Critical |
| Type and range checks | Number is number, date is date | Critical |
| Length limitation | Maximum length on all input fields | High |
| Unicode normalization | NFC/NFKC before validation | High |
| Parameterized queries | Never string concatenation in SQL | Critical |
| Template auto-escaping | Make sure it's enabled and don't bypass unnecessarily | Critical |
| Context-specific encoding | Use the correct encoding per sink | Critical |
| Filename sanitization | os.path.basename() + allowlist extensions |
High |
| Command arguments as list | subprocess.run(["cmd", arg]), never
shell=True |
Critical |
| API schema validation | Pydantic, JSON Schema, or equivalent | High |
It's actually quite simple. You need two rules. Two.
Rule one: trust nothing that comes from outside. Not the form field, not the URL, not the header, not the cookie, not the file, not the API response from the "trusted partner" whose system you pentested last year and that had three critical SQL injections at the time.
Rule two: when data leaves your system — to the browser, the database, the file system, the command line — encode it for that specific context. HTML in HTML, JavaScript in JavaScript, SQL via parameters.
Two rules. That's it. And yet SQL injection and XSS have existed for more than twenty-five years. We haven't solved them. We haven't even reduced them. They're still in the OWASP Top 10. They were in the first OWASP Top 10, in 2003. Twenty-three years ago.
The solution is known. The tools exist. The libraries are free. The documentation is excellent. But somewhere between "we know how to do it" and "we actually do it" there's a gap so wide that you could fit a data center in it. And in that gap there's a Post-it note with "TODO: input sanitization" that's been there since the first sprint.
Summary
Input validation and output encoding are the fundamental defenses against injection attacks. Validate with allowlists, restrict types and lengths, and normalize Unicode before processing. Encode every output for the specific context: HTML entities for HTML, parameterized queries for SQL, lists for command-line arguments.
In the next chapter, we cover the transport layer: how do you configure TLS so that all that carefully validated and encoded data also travels securely over the network?
Further reading in the knowledge base
These articles in the portal provide more background and practical context:
- APIs — the invisible glue of the internet
- SSL/TLS — why that lock icon in your browser matters
- Encryption — the art of making things unreadable
- Password hashing — how websites store your password
- Penetration tests vs. vulnerability scans
You need an account to access the knowledge base. Log in or register.
Related security measures
These articles provide additional context and depth: