Episode 7 · Module 3 · App Security

Input Validation and Sanitisation

18 May 2026 · 10:12 · Security for Legal SaaS

0:00 10:12

Every piece of data entering your legal SaaS arrives through a gate. Alice and Dan cover allowlists versus denylists, server-side versus client-side validation, domain-specific patterns for legal citations, and why silent sanitisation masks the attacks you need to see.

Today’s Lesson

Security for Legal SaaS — Episode 7: Input Validation and Sanitisation

The Gatekeeper Problem

Every piece of data that enters your legal SaaS application arrives through a gate. A form field, an API (Application Programming Interface — the way software components communicate with each other) parameter, a file upload, a webhook payload. Input validation is the practice of ensuring that data conforms to expected formats, types, and ranges before your application processes it.

Get this wrong, and you enable every injection attack that follows — SQL injection, XSS, command injection, path traversal. The 2023 OWASP Top 10 lists injection as A03, and virtually every injection vulnerability begins with insufficient input validation.

Key principle: Input validation is your first line of defence, but not your only one. OWASP advises treating validation as defence-in-depth — layer it with parameterised queries, output encoding, and architectural controls.

Allowlists vs Denylists

The most fundamental decision in validation design: do you define what's allowed, or what's forbidden?

Approach	Definition	Example
Allowlist (positive validation)	Only explicitly permitted patterns pass	Court ID must match `[A-Z]{2}\d{4}/\d{4}`
Denylist (negative validation)	Block known-bad patterns, allow everything else	Strip `<script>` tags from input

OWASP unequivocally recommends allowlisting: "Input validation should be applied using an allowlist approach, not a denylist." Denylists fail because:

Attackers find bypasses you didn't anticipate (encoding, Unicode normalisation, case variations)
The set of malicious inputs is infinite; the set of valid inputs is finite
Every major WAF bypass technique exploits denylist gaps

Legal SaaS example: A case reference field that strips angle brackets (< and >) to prevent XSS can be bypassed with event handlers: onerror=alert(1) embedded in an image tag using backtick substitution. An allowlist that requires the pattern [A-Z]{2}\s?\d{1,5}/\d{4} rejects everything malicious by definition — only valid case references pass.

Boundary Validation

Where should validation happen? The answer is at every trust boundary — but the mandatory boundary is the server.

The Validation Stack

Layer	Purpose	Enforces Security?
Client-side (browser)	UX feedback, reduce round-trips	No — attacker bypasses trivially
API gateway / middleware	Schema validation, rate limiting	Partially — coarse-grained
Application layer (server)	Business rule validation	Yes — primary enforcement point
Database layer	Type constraints, foreign keys	Yes — last-resort constraint

Client-side validation is never a security control. An attacker can disable JavaScript, modify the DOM (Document Object Model — the browser’s internal representation of a webpage that JavaScript can read and modify), or bypass the browser entirely with direct API calls. Every validation the client performs must be repeated on the server, independently.

Boundary validation means validating data every time it crosses a trust boundary — not just at initial input. A value that was safe when received might become dangerous after transformation. If you URL-decode user input and then use it in a database query, you must validate after decoding, not before.

Legal-Specific Input Patterns

Legal SaaS handles distinctive input types that demand domain-specific validation rules:

Input Type	Expected Format	Validation Approach
Singapore case citation	`SGCA 15`	Regex: `\[\d{4}\]\s+[A-Z]{2,6}\s+\d{1,5}`
UK neutral citation	`UKSC 42`	Regex: `\[\d{4}\]\s+[A-Z]{2,6}\s+\d{1,5}`
US case docket	`1:23-cv-04521`	Regex: `\d{1,2}:\d{2}-[a-z]{2,3}-\d{4,6}`
Court identifier	`SGCA`, `SGHC`, `UKSC`	Allowlist of known court codes
Statute reference	`s 34(1)(a)`	Pattern with known section formats
Date fields	ISO 8601 or jurisdiction-specific	Strict date parsing, range validation
Client matter number	Firm-specific format	Configurable regex per firm tenant

CWE-20 (Improper Input Validation) is the root weakness behind dozens of more specific vulnerabilities. For legal identifiers, always validate against the jurisdiction's actual format specification — not a loose pattern that happens to work for your test data.

Server-Side Validation Is Mandatory

The OWASP Application Security Verification Standard (ASVS) Level 1 requires:

All input validation is performed server-side
Input validation failures result in request rejection (not silent sanitisation)
Structured data is strongly typed and validated against schemas
Unstructured data is sanitised with allowlists of permitted characters

Why not just sanitise? Sanitisation (modifying input to make it safe) is riskier than rejection because:

You might miss an encoding your sanitiser doesn't handle
Modified input might have unintended meaning in business context
Silent sanitisation masks attacks — you should log and alert on validation failures

The safe pattern: validate first (reject if invalid), then sanitise where necessary for specific output contexts (HTML encoding, SQL parameterisation).

Schema Validation with ORMs

Modern legal SaaS typically uses an ORM (Object-Relational Mapper — a library that lets you write database queries in your programming language instead of raw SQL, the standard language for querying databases) like Prisma, SQLAlchemy, or TypeORM. ORMs provide schema-level validation as a byproduct of their type system:

Prisma's schema validation enforces types at the application boundary:

Best practice: Use Zod (TypeScript) or Pydantic (Python) as a validation layer between your API endpoints and your ORM. These schema validation libraries catch malformed input before it reaches your business logic:

const CaseInput = z.object({
  citation: z.string().regex(/^\[\d{4}\]\s+[A-Z]{2,6}\s+\d{1,5}$/),
  courtCode: z.enum(['SGCA', 'SGHC', 'SGDC', 'UKSC', 'EWCA']),
  filedAt: z.string().datetime(),
});

Common Validation Failures in Legal Tech

Failure Mode	Consequence	Fix
Trusting client-side validation	Injection attacks bypass browser checks	Duplicate all validation server-side
Validating before decoding	Encoded payloads bypass checks	Canonicalise first, validate second
Type coercion assumptions	`"0" == false` in JS, bypasses boolean checks	Use strict type checking
Missing length limits	Buffer overflows, DoS via megabyte-sized inputs	Enforce max lengths on all string inputs
Unicode normalisation gaps	Homoglyph attacks, normalisation-based bypasses	Normalise to NFC before validation
Allowing null bytes	Null byte injection truncates strings in C-based systems	Reject null bytes unconditionally

PortSwigger's research on encoding-based bypasses demonstrates that sophisticated attackers routinely exploit the gap between what validators check and what interpreters execute. Double-encoding, Unicode escapes, and mixed encoding schemes bypass denylists consistently.

Validation as Logging Trigger

OWASP's Logging Cheat Sheet recommends logging all input validation failures. In legal SaaS, a sudden spike in validation failures on a case search endpoint could indicate:

An attacker probing for injection vulnerabilities
A compromised account running automated exfiltration scripts
A fuzzing tool mapping your attack surface

Log the failure, the input (sanitised for the log itself), the source IP, and the authenticated user. Feed this into your SIEM (Security Information and Event Management — the system that aggregates and analyses security logs). NIST SP 800-92 provides guidance on log management for security monitoring.

Conclusion

Input validation is where security begins — the point where untrusted data from the outside world meets your application's internal logic. Allowlist over denylist. Server-side over client-side. Reject over sanitise. Log over ignore. Combined with schema validation from ORMs and runtime validators like Zod or Pydantic, you create multiple layers of defence before data reaches your business logic.

Next episode: SQL Injection and ORM Safety — what happens when input validation fails and untrusted data reaches your database queries.

Alice: Welcome back to Security for Legal SaaS. I’m Alice.

Dan: And I’m Dan. Episode 7 — Input Validation and Sanitisation. Alice, last episode we covered how web apps communicate. Today we’re at the front gate. Data comes in from the outside world. How do we decide what gets through?

Alice: This is the single most impactful security practice you can implement. Every injection attack — SQL injection, cross-site scripting, command injection — starts with invalid input that your application processed without checking. Input validation is asking: does this data look like what I expected?

Dan: So give me the fundamental decision. How do you define “valid”?

Alice: Two approaches. An allowlist says “only these specific patterns are permitted.” A denylist says “block these known-bad patterns, allow everything else.” OWASP is unequivocal: use allowlists. Always.

Dan: Why is the denylist approach so dangerous?

Alice: Because the set of malicious inputs is infinite. You block script tags? The attacker uses event handlers. You block angle brackets? They use Unicode equivalents. You block single quotes? They use double encoding. Every WAF bypass technique in existence exploits denylist gaps. But an allowlist that says “a Singapore case citation must match the pattern bracket, four digits, bracket, space, two to six uppercase letters, space, one to five digits” — that rejects everything that isn’t a valid citation. Including every possible attack payload.

Dan: That’s elegant for structured data. But what about free-text fields? A lawyer’s case notes, for instance.

Alice: Free text is harder, which is why validation alone isn’t sufficient. For free-text, you validate what you can — character encoding must be UTF-8, length must be under your maximum, no null bytes. Then you rely on output encoding and parameterised queries to handle the content safely. Validation is your first layer, not your only layer.

Dan: Where does validation need to happen? I’ve seen debates about client-side versus server-side.

Alice: There’s no debate. Server-side is mandatory. Client-side is optional UX polish. An attacker can disable JavaScript, modify the DOM, or bypass the browser entirely by calling your API directly with curl. If your only validation is in the browser, you have no validation.

Dan: So why do client-side at all?

Alice: User experience. Immediate feedback when a lawyer types an invalid case number. Fewer wasted round-trips to the server. But it’s never security — it’s convenience. The server must independently validate everything, regardless of what the client claims to have checked.

Dan: Legal SaaS has some distinctive input types. Case citations, court codes, statute references. How do you handle those?

Alice: Domain-specific allowlists. A Singapore case citation follows the pattern bracket-year-bracket, court code, case number. You can define that as a regex and reject anything that doesn’t match. Court identifiers — SGCA, SGHC, UKSC, EWCA — are a finite set. That’s a strict allowlist, an enum. Statute references have known patterns per jurisdiction. The key insight is that legal data is often highly structured, which makes it easier to validate strictly than general web input.

Dan: What about the tooling layer? I hear a lot about Zod, Pydantic, schema validators.

Alice: These are runtime schema validators that sit between your API endpoint and your business logic. Zod for TypeScript, Pydantic for Python. You define a schema — this field is a string matching this regex, that field is an enum of these values, this date must be in the past — and the library validates incoming data against it automatically. If validation fails, the request is rejected before your business logic ever sees it.

Dan: And ORMs add another layer?

Alice: Yes. Your ORM — Object-Relational Mapper, the library that generates database queries from your code — its schema defines type constraints, length limits, foreign key relationships, and enum restrictions at the database layer. Between Zod at the API boundary and Prisma at the database boundary, malformed data has to pass through two independently defined validation layers. That’s defence in depth.

Dan: One thing I want to clarify — validation versus sanitisation. They’re different?

Alice: Critically different. Validation checks whether input conforms to expectations and rejects it if it doesn’t. Sanitisation modifies input to make it safe — stripping dangerous characters, encoding special characters. The safer pattern is validate first, reject if invalid. Only sanitise when you have a legitimate business reason to accept input that contains characters meaningful in another context.

Dan: Why is sanitisation riskier than rejection?

Alice: Three reasons. First, your sanitiser might miss an encoding it doesn’t handle. Second, modified input might have unintended meaning — you strip characters from a case citation and now it references the wrong case. Third, silent sanitisation masks attacks. If an input field is receiving SQL injection attempts, you want to know about it. Logging a validation rejection gives you visibility. Silently cleaning the input hides the attack.

Dan: That connects to monitoring, right? Validation failures as security signals?

Alice: Absolutely. A sudden spike in validation failures on your case search endpoint could mean an attacker is probing for injection vulnerabilities. Or a compromised account is running automated scripts. Log every validation failure — the endpoint, the rejected input sanitised for the log, the source IP, the authenticated user. Feed that into your SIEM — your Security Information and Event Management system. OWASP recommends this explicitly in their logging cheat sheet.

Dan: What about the ordering problem? I’ve seen validators that check input before it’s decoded.

Alice: Classic mistake. If you validate before URL-decoding, an attacker sends %3Cscript%3E — your validator sees percent signs, not angle brackets, and passes it through. Then your application decodes it, and now you have a script tag in your data. The rule is canonicalise first — decode, normalise Unicode to NFC, resolve any encoding layers — then validate. Always validate the final form of the data, not an intermediate representation.

Dan: Any other common mistakes in the legal tech context?

Alice: Missing length limits is huge. Without them, an attacker can submit a 50-megabyte string as a case note and crash your parser or exhaust server memory. Type coercion in JavaScript — where the string “0” equals false — leads to bypass conditions in boolean checks. And null byte injection — inserting a zero byte that some C-based libraries interpret as a string terminator, causing truncation in file paths or database queries. Reject null bytes unconditionally in any web input.

Dan: So the summary: allowlist over denylist, server-side is mandatory, validate before your business logic sees data, log failures as security signals, and canonicalise before you validate.

Alice: And remember — validation stops the obviously wrong. The next episodes cover what happens when subtly wrong data gets through. Parameterised queries, output encoding, content security policies — those are your deeper defences.

Dan: Next episode — SQL Injection and ORM Safety. What happens when untrusted input reaches your database queries. The granddaddy of injection attacks.

Alice: Until then, I’m Alice.

Dan: And I’m Dan.

Alice: Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.

Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.