Today’s Lesson
Security for Legal SaaS — Episode 7: Input Validation and Sanitisation
The Gatekeeper Problem
Every piece of data that enters your legal SaaS application arrives through a gate. A form field, an API (Application Programming Interface — the way software components communicate with each other) parameter, a file upload, a webhook payload. Input validation is the practice of ensuring that data conforms to expected formats, types, and ranges before your application processes it.
Get this wrong, and you enable every injection attack that follows — SQL injection, XSS, command injection, path traversal. The 2023 OWASP Top 10 lists injection as A03, and virtually every injection vulnerability begins with insufficient input validation.
Key principle: Input validation is your first line of defence, but not your only one. OWASP advises treating validation as defence-in-depth — layer it with parameterised queries, output encoding, and architectural controls.
Allowlists vs Denylists
The most fundamental decision in validation design: do you define what's allowed, or what's forbidden?
| Approach | Definition | Example |
|---|---|---|
| Allowlist (positive validation) | Only explicitly permitted patterns pass | Court ID must match [A-Z]{2}\d{4}/\d{4} |
| Denylist (negative validation) | Block known-bad patterns, allow everything else | Strip <script> tags from input |
OWASP unequivocally recommends allowlisting: "Input validation should be applied using an allowlist approach, not a denylist." Denylists fail because:
- Attackers find bypasses you didn't anticipate (encoding, Unicode normalisation, case variations)
- The set of malicious inputs is infinite; the set of valid inputs is finite
- Every major WAF bypass technique exploits denylist gaps
Legal SaaS example: A case reference field that strips angle brackets (< and >) to prevent XSS can be bypassed with event handlers: onerror=alert(1) embedded in an image tag using backtick substitution. An allowlist that requires the pattern [A-Z]{2}\s?\d{1,5}/\d{4} rejects everything malicious by definition — only valid case references pass.
Boundary Validation
Where should validation happen? The answer is at every trust boundary — but the mandatory boundary is the server.
The Validation Stack
| Layer | Purpose | Enforces Security? |
|---|---|---|
| Client-side (browser) | UX feedback, reduce round-trips | No — attacker bypasses trivially |
| API gateway / middleware | Schema validation, rate limiting | Partially — coarse-grained |
| Application layer (server) | Business rule validation | Yes — primary enforcement point |
| Database layer | Type constraints, foreign keys | Yes — last-resort constraint |
Client-side validation is never a security control. An attacker can disable JavaScript, modify the DOM (Document Object Model — the browser’s internal representation of a webpage that JavaScript can read and modify), or bypass the browser entirely with direct API calls. Every validation the client performs must be repeated on the server, independently.
Boundary validation means validating data every time it crosses a trust boundary — not just at initial input. A value that was safe when received might become dangerous after transformation. If you URL-decode user input and then use it in a database query, you must validate after decoding, not before.
Legal-Specific Input Patterns
Legal SaaS handles distinctive input types that demand domain-specific validation rules:
| Input Type | Expected Format | Validation Approach |
|---|---|---|
| Singapore case citation | SGCA 15 | Regex: \[\d{4}\]\s+[A-Z]{2,6}\s+\d{1,5} |
| UK neutral citation | UKSC 42 | Regex: \[\d{4}\]\s+[A-Z]{2,6}\s+\d{1,5} |
| US case docket | 1:23-cv-04521 | Regex: \d{1,2}:\d{2}-[a-z]{2,3}-\d{4,6} |
| Court identifier | SGCA, SGHC, UKSC | Allowlist of known court codes |
| Statute reference | s 34(1)(a) | Pattern with known section formats |
| Date fields | ISO 8601 or jurisdiction-specific | Strict date parsing, range validation |
| Client matter number | Firm-specific format | Configurable regex per firm tenant |
CWE-20 (Improper Input Validation) is the root weakness behind dozens of more specific vulnerabilities. For legal identifiers, always validate against the jurisdiction's actual format specification — not a loose pattern that happens to work for your test data.
Server-Side Validation Is Mandatory
The OWASP Application Security Verification Standard (ASVS) Level 1 requires:
- All input validation is performed server-side
- Input validation failures result in request rejection (not silent sanitisation)
- Structured data is strongly typed and validated against schemas
- Unstructured data is sanitised with allowlists of permitted characters
Why not just sanitise? Sanitisation (modifying input to make it safe) is riskier than rejection because:
- You might miss an encoding your sanitiser doesn't handle
- Modified input might have unintended meaning in business context
- Silent sanitisation masks attacks — you should log and alert on validation failures
The safe pattern: validate first (reject if invalid), then sanitise where necessary for specific output contexts (HTML encoding, SQL parameterisation).
Schema Validation with ORMs
Modern legal SaaS typically uses an ORM (Object-Relational Mapper — a library that lets you write database queries in your programming language instead of raw SQL, the standard language for querying databases) like Prisma, SQLAlchemy, or TypeORM. ORMs provide schema-level validation as a byproduct of their type system:
Prisma's schema validation enforces types at the application boundary:
Best practice: Use Zod (TypeScript) or Pydantic (Python) as a validation layer between your API endpoints and your ORM. These schema validation libraries catch malformed input before it reaches your business logic:
const CaseInput = z.object({
citation: z.string().regex(/^\[\d{4}\]\s+[A-Z]{2,6}\s+\d{1,5}$/),
courtCode: z.enum(['SGCA', 'SGHC', 'SGDC', 'UKSC', 'EWCA']),
filedAt: z.string().datetime(),
});
Common Validation Failures in Legal Tech
| Failure Mode | Consequence | Fix |
|---|---|---|
| Trusting client-side validation | Injection attacks bypass browser checks | Duplicate all validation server-side |
| Validating before decoding | Encoded payloads bypass checks | Canonicalise first, validate second |
| Type coercion assumptions | "0" == false in JS, bypasses boolean checks | Use strict type checking |
| Missing length limits | Buffer overflows, DoS via megabyte-sized inputs | Enforce max lengths on all string inputs |
| Unicode normalisation gaps | Homoglyph attacks, normalisation-based bypasses | Normalise to NFC before validation |
| Allowing null bytes | Null byte injection truncates strings in C-based systems | Reject null bytes unconditionally |
PortSwigger's research on encoding-based bypasses demonstrates that sophisticated attackers routinely exploit the gap between what validators check and what interpreters execute. Double-encoding, Unicode escapes, and mixed encoding schemes bypass denylists consistently.
Validation as Logging Trigger
OWASP's Logging Cheat Sheet recommends logging all input validation failures. In legal SaaS, a sudden spike in validation failures on a case search endpoint could indicate:
- An attacker probing for injection vulnerabilities
- A compromised account running automated exfiltration scripts
- A fuzzing tool mapping your attack surface
Log the failure, the input (sanitised for the log itself), the source IP, and the authenticated user. Feed this into your SIEM (Security Information and Event Management — the system that aggregates and analyses security logs). NIST SP 800-92 provides guidance on log management for security monitoring.
Conclusion
Input validation is where security begins — the point where untrusted data from the outside world meets your application's internal logic. Allowlist over denylist. Server-side over client-side. Reject over sanitise. Log over ignore. Combined with schema validation from ORMs and runtime validators like Zod or Pydantic, you create multiple layers of defence before data reaches your business logic.
Next episode: SQL Injection and ORM Safety — what happens when input validation fails and untrusted data reaches your database queries.