Today’s Lesson
Security for Legal SaaS — Episode 10: File Upload Security
Every Upload Is a Payload
File uploads are the most dangerous feature in any web application. A file is an opaque blob of bytes that your server must receive, store, and potentially process. OWASP identifies unrestricted file upload as a critical web application vulnerability because it can lead to remote code execution, cross-site scripting, denial of service, and complete system compromise.
For legal SaaS, file uploads are unavoidable. Lawyers upload contracts, pleadings, evidence, exhibits, expert reports. The document management system is the product. You cannot eliminate file uploads — you must make them safe.
Key principle: A file's extension and Content-Type header are claims made by the uploader. Neither can be trusted. Validation must examine the file's actual content — its magic bytes, structure, and embedded elements.
Dangerous File Types
Direct Execution Threats
| File Type | Danger | Attack Scenario |
|---|---|---|
.php, .asp, .jsp |
Server-side execution | Upload web shell, access via URL, execute arbitrary commands |
.html, .svg |
Client-side execution | Stored XSS when served from same origin |
.exe, .bat, .ps1 |
OS execution | Social engineering: “Download this exhibit” |
.jar, .class |
Java execution | Bypass server-side restrictions via deserialization |
Polyglot Files
A polyglot is a file that is simultaneously valid as multiple file types. Researchers have demonstrated files that are valid JPEGs and valid JavaScript, valid PDFs and valid ZIP archives, and valid images that contain hidden HTML.
Attack: Upload a polyglot JPEG/JavaScript file. The server validates it as an image (magic bytes match JPEG). When served with an incorrect Content-Type or without X-Content-Type-Options: nosniff, the browser interprets it as JavaScript — executing the attacker’s code.
Archive Bombs (Zip Bombs)
A zip bomb is a compressed file that expands to an enormous size — gigabytes or terabytes. The classic 42.zip is 42KB compressed but expands to 4.5 petabytes across nested archives.
Legal SaaS impact: If your system automatically extracts uploaded archives (common for bulk evidence uploads), a zip bomb exhausts disk space and memory, causing denial of service. Nested archives amplify the effect exponentially.
Script-Containing Documents
| Format | Script Capability | Example |
|---|---|---|
| JavaScript (Acrobat), Launch actions | Auto-execute on open; phishing overlays | |
| DOCX | VBA macros, ActiveX controls, DDE | Remote code execution via macro-enabled documents |
| XLSX | Macros, external data connections, DDE | Data exfiltration via external cell references |
| SVG | Inline JavaScript, event handlers | XSS when rendered in browser |
| HTML/MHTML | Full JavaScript execution | Complete web application in a single file |
Magic Byte Validation
Every file format has a characteristic byte sequence at the beginning — its magic bytes or file signature. Validating these is more reliable than checking file extensions:
| Format | Magic Bytes (hex) | Offset |
|---|---|---|
25 50 44 46 (%PDF) |
0 | |
| DOCX/XLSX/PPTX | 50 4B 03 04 (ZIP header) |
0 |
| JPEG | FF D8 FF |
0 |
| PNG | 89 50 4E 47 0D 0A 1A 0A |
0 |
| TIFF | 49 49 2A 00 (little-endian) or 4D 4D 00 2A (big-endian) |
0 |
| GIF | 47 49 46 38 (GIF8) |
0 |
import magic # python-magic library
def validate_file_type(file_bytes: bytes, expected_mime: str) -> bool:
"""Validate file content matches expected MIME type using libmagic."""
detected = magic.from_buffer(file_bytes[:2048], mime=True)
return detected == expected_mime
Warning: Magic byte validation alone is insufficient for security. Polyglot files have valid magic bytes for their cover format. A polyglot JPEG has valid JPEG magic bytes while also containing executable JavaScript. Magic bytes confirm format but do not guarantee safety. You must also validate internal structure and strip dangerous elements.
Sandboxed Processing
Files must be processed in isolation — never in your application’s main execution context. OWASP recommends processing uploads in sandboxed environments:
Processing Architecture
Upload → Quarantine Storage → Sandbox Worker → Validated Storage → Application
| Stage | Control | Purpose |
|---|---|---|
| Quarantine | Separate storage, no direct URL access | Prevent execution before validation |
| Sandbox | Container/VM with no network, limited filesystem | Contain exploitation of parser vulnerabilities |
| Validation | Magic bytes + structure check + virus scan | Confirm file is what it claims |
| Transformation | Re-encode images, flatten PDFs, strip macros | Remove embedded threats |
| Validated storage | Immutable object storage with random names | Prevent path traversal and overwrite |
Why Sandboxing Matters
File parsing libraries have vulnerabilities. ImageMagick’s “ImageTragick” (CVE-2016-3714) allowed remote code execution via crafted images — any server processing user-uploaded images with ImageMagick was vulnerable. Ghostscript has had multiple RCE vulnerabilities in PDF processing. LibreOffice has had command injection via document conversion.
If your processing runs in a sandboxed container with no network access and minimal filesystem permissions, exploitation of a parser vulnerability is contained rather than granting the attacker your entire server.
Storage Isolation
Never Store in Web-Accessible Directories
If uploaded files are stored in a directory served by your web server (e.g., /var/www/uploads/), an attacker who uploads a PHP file can access it via URL and achieve remote code execution. Store files outside the web root and serve them through a controlled endpoint that sets appropriate headers.
Storage Security Requirements
| Requirement | Implementation |
|---|---|
| Random filenames | UUID or content-hash; never user-supplied names |
| No directory listing | Disable index pages on storage endpoints |
| Separate origin | Serve from files.legalapp.com, not app.legalapp.com |
| Content-Disposition: attachment | Forces download instead of inline rendering |
| X-Content-Type-Options: nosniff | Prevents MIME sniffing |
| No execution permissions | Storage directory has no execute bits |
| Access control on retrieval | Verify user authorisation before serving each file |
Object Storage (S3-Pattern)
Cloud object storage (AWS S3, GCS, Azure Blob) provides inherent isolation — files cannot be “executed” by the storage service. Use pre-signed URLs with short expiry for authorised access:
# Generate pre-signed URL — expires in 5 minutes
url = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': 'legal-docs', 'Key': f'matters/{matter_id}/{file_id}'},
ExpiresIn=300
)
Legal Document Format Attack Vectors
PDF Attacks
PDFs are the most common document format in legal practice and one of the most complex file formats in existence. The PDF specification (ISO 32000) supports:
- JavaScript — auto-execute on open, form validation, dynamic content
- Launch actions — open external applications
- Embedded files — store executables within the PDF
- External references — phone-home when opened (tracking pixel equivalent)
- Form submission — send form data to external URLs
Mitigation: Process uploaded PDFs through a sanitisation pipeline that strips JavaScript, actions, and embedded files. Tools like QPDF can linearise and sanitise PDFs. For preview, render to images using a sandboxed PDF renderer.
DOCX Attack Vectors
DOCX files are ZIP archives containing XML. Attack vectors include:
- VBA macros — arbitrary code execution (macro-enabled .docm)
- External template references — document loads remote template on open (NTLM hash theft)
- DDE (Dynamic Data Exchange) — execute commands without macros
- Embedded OLE objects — nested malicious files
- XML External Entity (XXE) — if your parser resolves external entities in DOCX XML
Microsoft’s documentation on macro security acknowledges that macro-based malware remains one of the most common initial access vectors.
Mitigation: Strip macros from all uploaded DOCX files. Reject .docm and .xlsm extensions entirely. Parse DOCX XML with a non-resolving XML parser (disable external entities). Block external template references.
TIFF Attack Vectors
TIFF files — common in legal scanning workflows — have historically been vulnerable to buffer overflow attacks in parsing libraries (libtiff). TIFF supports multiple pages, compression schemes, and metadata formats that expand the attack surface.
Mitigation: Process TIFFs in a sandboxed environment. Re-encode to a simpler format (PNG) for storage if multi-page support isn’t needed. Keep libtiff updated.
Upload Validation Checklist
| Check | Implementation |
|---|---|
| File size limit | Reject before full upload (Content-Length check + streaming limit) |
| Extension allowlist | Only .pdf, .docx, .xlsx, .tiff, .png, .jpg |
| Magic byte validation | Confirm header bytes match claimed type |
| MIME type verification | Use libmagic, not user-supplied Content-Type |
| Virus scan | ClamAV or cloud AV API on quarantined file |
| Archive depth limit | Max 2 levels of nesting for ZIP/RAR |
| Archive expansion limit | Abort if extracted size exceeds 100x compressed size |
| Macro stripping | Remove VBA/DDE from Office documents |
| PDF sanitisation | Strip JavaScript, actions, embedded files |
| Image re-encoding | Decode and re-encode images to strip embedded payloads |
| Filename sanitisation | Replace with UUID; never use user-supplied filename in filesystem paths |
Conclusion
File uploads combine the highest risk with the highest business necessity in legal SaaS. Every document from a client or opposing party is potentially adversarial content crossing a trust boundary. The defence is layered: validate format, scan for malware, process in sandboxes, store in isolation, serve with restrictive headers, and never trust the client’s claims about what the file contains. Your document management system is only as secure as your file upload pipeline.
Next episode: Webhook Security and SSRF — when your server makes requests on behalf of external services, and attackers exploit that trust.