Episode 10 · Module 3 · App Security

File Upload Security

18 May 2026 · 10:57 · Security for Legal SaaS

0:00 10:57

Every uploaded file is an opaque blob of bytes crossing a trust boundary. Alice and Dan cover polyglot files, zip bombs, magic byte validation, sandboxed processing pipelines, storage isolation, and the specific attack vectors in PDF, DOCX, and TIFF formats that make legal SaaS — where the document management system is the product — uniquely exposed.

Today’s Lesson

Security for Legal SaaS — Episode 10: File Upload Security

Every Upload Is a Payload

File uploads are the most dangerous feature in any web application. A file is an opaque blob of bytes that your server must receive, store, and potentially process. OWASP identifies unrestricted file upload as a critical web application vulnerability because it can lead to remote code execution, cross-site scripting, denial of service, and complete system compromise.

For legal SaaS, file uploads are unavoidable. Lawyers upload contracts, pleadings, evidence, exhibits, expert reports. The document management system is the product. You cannot eliminate file uploads — you must make them safe.

Key principle: A file's extension and Content-Type header are claims made by the uploader. Neither can be trusted. Validation must examine the file's actual content — its magic bytes, structure, and embedded elements.

Dangerous File Types

Direct Execution Threats

File Type	Danger	Attack Scenario
`.php`, `.asp`, `.jsp`	Server-side execution	Upload web shell, access via URL, execute arbitrary commands
`.html`, `.svg`	Client-side execution	Stored XSS when served from same origin
`.exe`, `.bat`, `.ps1`	OS execution	Social engineering: “Download this exhibit”
`.jar`, `.class`	Java execution	Bypass server-side restrictions via deserialization

Polyglot Files

A polyglot is a file that is simultaneously valid as multiple file types. Researchers have demonstrated files that are valid JPEGs and valid JavaScript, valid PDFs and valid ZIP archives, and valid images that contain hidden HTML.

Attack: Upload a polyglot JPEG/JavaScript file. The server validates it as an image (magic bytes match JPEG). When served with an incorrect Content-Type or without X-Content-Type-Options: nosniff, the browser interprets it as JavaScript — executing the attacker’s code.

Archive Bombs (Zip Bombs)

A zip bomb is a compressed file that expands to an enormous size — gigabytes or terabytes. The classic 42.zip is 42KB compressed but expands to 4.5 petabytes across nested archives.

Legal SaaS impact: If your system automatically extracts uploaded archives (common for bulk evidence uploads), a zip bomb exhausts disk space and memory, causing denial of service. Nested archives amplify the effect exponentially.

Script-Containing Documents

Format	Script Capability	Example
PDF	JavaScript (Acrobat), Launch actions	Auto-execute on open; phishing overlays
DOCX	VBA macros, ActiveX controls, DDE	Remote code execution via macro-enabled documents
XLSX	Macros, external data connections, DDE	Data exfiltration via external cell references
SVG	Inline JavaScript, event handlers	XSS when rendered in browser
HTML/MHTML	Full JavaScript execution	Complete web application in a single file

Magic Byte Validation

Every file format has a characteristic byte sequence at the beginning — its magic bytes or file signature. Validating these is more reliable than checking file extensions:

Format	Magic Bytes (hex)	Offset
PDF	`25 50 44 46` (`%PDF`)	0
DOCX/XLSX/PPTX	`50 4B 03 04` (ZIP header)	0
JPEG	`FF D8 FF`	0
PNG	`89 50 4E 47 0D 0A 1A 0A`	0
TIFF	`49 49 2A 00` (little-endian) or `4D 4D 00 2A` (big-endian)	0
GIF	`47 49 46 38` (`GIF8`)	0

import magic  # python-magic library

def validate_file_type(file_bytes: bytes, expected_mime: str) -> bool:
    """Validate file content matches expected MIME type using libmagic."""
    detected = magic.from_buffer(file_bytes[:2048], mime=True)
    return detected == expected_mime

Warning: Magic byte validation alone is insufficient for security. Polyglot files have valid magic bytes for their cover format. A polyglot JPEG has valid JPEG magic bytes while also containing executable JavaScript. Magic bytes confirm format but do not guarantee safety. You must also validate internal structure and strip dangerous elements.

Sandboxed Processing

Files must be processed in isolation — never in your application’s main execution context. OWASP recommends processing uploads in sandboxed environments:

Processing Architecture

Upload → Quarantine Storage → Sandbox Worker → Validated Storage → Application

Stage	Control	Purpose
Quarantine	Separate storage, no direct URL access	Prevent execution before validation
Sandbox	Container/VM with no network, limited filesystem	Contain exploitation of parser vulnerabilities
Validation	Magic bytes + structure check + virus scan	Confirm file is what it claims
Transformation	Re-encode images, flatten PDFs, strip macros	Remove embedded threats
Validated storage	Immutable object storage with random names	Prevent path traversal and overwrite

Why Sandboxing Matters

File parsing libraries have vulnerabilities. ImageMagick’s “ImageTragick” (CVE-2016-3714) allowed remote code execution via crafted images — any server processing user-uploaded images with ImageMagick was vulnerable. Ghostscript has had multiple RCE vulnerabilities in PDF processing. LibreOffice has had command injection via document conversion.

If your processing runs in a sandboxed container with no network access and minimal filesystem permissions, exploitation of a parser vulnerability is contained rather than granting the attacker your entire server.

Storage Isolation

Never Store in Web-Accessible Directories

If uploaded files are stored in a directory served by your web server (e.g., /var/www/uploads/), an attacker who uploads a PHP file can access it via URL and achieve remote code execution. Store files outside the web root and serve them through a controlled endpoint that sets appropriate headers.

Storage Security Requirements

Requirement	Implementation
Random filenames	UUID or content-hash; never user-supplied names
No directory listing	Disable index pages on storage endpoints
Separate origin	Serve from `files.legalapp.com`, not `app.legalapp.com`
Content-Disposition: attachment	Forces download instead of inline rendering
X-Content-Type-Options: nosniff	Prevents MIME sniffing
No execution permissions	Storage directory has no execute bits
Access control on retrieval	Verify user authorisation before serving each file

Object Storage (S3-Pattern)

Cloud object storage (AWS S3, GCS, Azure Blob) provides inherent isolation — files cannot be “executed” by the storage service. Use pre-signed URLs with short expiry for authorised access:

# Generate pre-signed URL — expires in 5 minutes
url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'legal-docs', 'Key': f'matters/{matter_id}/{file_id}'},
    ExpiresIn=300
)

Legal Document Format Attack Vectors

PDF Attacks

PDFs are the most common document format in legal practice and one of the most complex file formats in existence. The PDF specification (ISO 32000) supports:

- JavaScript — auto-execute on open, form validation, dynamic content

- Launch actions — open external applications

- Embedded files — store executables within the PDF

- External references — phone-home when opened (tracking pixel equivalent)

- Form submission — send form data to external URLs

Mitigation: Process uploaded PDFs through a sanitisation pipeline that strips JavaScript, actions, and embedded files. Tools like QPDF can linearise and sanitise PDFs. For preview, render to images using a sandboxed PDF renderer.

DOCX Attack Vectors

DOCX files are ZIP archives containing XML. Attack vectors include:

- VBA macros — arbitrary code execution (macro-enabled .docm)

- External template references — document loads remote template on open (NTLM hash theft)

- DDE (Dynamic Data Exchange) — execute commands without macros

- Embedded OLE objects — nested malicious files

- XML External Entity (XXE) — if your parser resolves external entities in DOCX XML

Microsoft’s documentation on macro security acknowledges that macro-based malware remains one of the most common initial access vectors.

Mitigation: Strip macros from all uploaded DOCX files. Reject .docm and .xlsm extensions entirely. Parse DOCX XML with a non-resolving XML parser (disable external entities). Block external template references.

TIFF Attack Vectors

TIFF files — common in legal scanning workflows — have historically been vulnerable to buffer overflow attacks in parsing libraries (libtiff). TIFF supports multiple pages, compression schemes, and metadata formats that expand the attack surface.

Mitigation: Process TIFFs in a sandboxed environment. Re-encode to a simpler format (PNG) for storage if multi-page support isn’t needed. Keep libtiff updated.

Upload Validation Checklist

Check	Implementation
File size limit	Reject before full upload (Content-Length check + streaming limit)
Extension allowlist	Only `.pdf`, `.docx`, `.xlsx`, `.tiff`, `.png`, `.jpg`
Magic byte validation	Confirm header bytes match claimed type
MIME type verification	Use `libmagic`, not user-supplied Content-Type
Virus scan	ClamAV or cloud AV API on quarantined file
Archive depth limit	Max 2 levels of nesting for ZIP/RAR
Archive expansion limit	Abort if extracted size exceeds 100x compressed size
Macro stripping	Remove VBA/DDE from Office documents
PDF sanitisation	Strip JavaScript, actions, embedded files
Image re-encoding	Decode and re-encode images to strip embedded payloads
Filename sanitisation	Replace with UUID; never use user-supplied filename in filesystem paths

Conclusion

File uploads combine the highest risk with the highest business necessity in legal SaaS. Every document from a client or opposing party is potentially adversarial content crossing a trust boundary. The defence is layered: validate format, scan for malware, process in sandboxes, store in isolation, serve with restrictive headers, and never trust the client’s claims about what the file contains. Your document management system is only as secure as your file upload pipeline.

Next episode: Webhook Security and SSRF — when your server makes requests on behalf of external services, and attackers exploit that trust.

Alice: Welcome back to Security for Legal SaaS. I’m Alice.

Dan: And I’m Dan. Episode 10 — File Upload Security. Alice, legal tech is fundamentally a document management business. Lawyers upload files constantly. Contracts, pleadings, exhibits, evidence packages. So what’s the threat here?

Alice: Every uploaded file is an opaque blob of bytes crossing a trust boundary into your system. The file claims to be a PDF. The Content-Type header says application/pdf. A MIME type — Multipurpose Internet Mail Extensions type — is just a label declaring the file’s format. The extension says .pdf. But none of that is verified until you examine the actual bytes. A file is whatever its content makes it — not whatever the uploader claims it is.

Dan: So the extension and MIME type are essentially just labels the attacker controls?

Alice: Exactly. An attacker can rename a PHP web shell to evidence.pdf, set the Content-Type to application/pdf, and upload it. If your server stores it in a web-accessible directory and serves it based on extension alone, they navigate to the URL and the web server executes it as PHP. They now have remote code execution — the ability to run arbitrary commands on your server as if they were sitting at the keyboard.

Dan: That sounds almost too simple. Surely modern applications don’t make that mistake?

Alice: The basic version is less common now. But the attack has evolved. Polyglot files — files that are simultaneously valid as multiple types. A file that has valid JPEG magic bytes at the start, passes image validation, but also contains executable JavaScript. Served without the right headers, the browser may interpret it as a script instead of an image. PortSwigger demonstrated bypassing Content Security Policy — the browser-enforced rules about what scripts can run, which we’ll cover in depth in Episode 12 — using polyglot JPEGs.

Dan: And then there are archive bombs?

Alice: Zip bombs. A 42-kilobyte ZIP file that expands to 4.5 petabytes across nested layers. If your legal SaaS automatically extracts uploaded archives — which is common for bulk evidence ingestion — that expansion exhausts disk space, memory, or both. Your entire platform goes down. Denial of service from a single 42KB upload.

Dan: OK, so what’s the validation approach? You said magic bytes?

Alice: Magic bytes are the first line of defence. Every file format has a characteristic byte sequence at the beginning. PDF starts with percent-P-D-F. JPEG starts with FF-D8-FF. PNG has an 8-byte signature. DOCX files are actually ZIP archives, so they start with PK-03-04. You read the first few bytes and confirm they match the claimed format. The python-magic library wraps libmagic, which identifies files by content rather than extension.

Dan: But you said polyglots have valid magic bytes for their cover format. So magic bytes alone aren’t enough.

Alice: Correct. Magic bytes confirm format identity but not safety. A polyglot JPEG has valid JPEG magic bytes. A PDF with embedded JavaScript has valid PDF magic bytes. You need deeper structural validation. For images — decode and re-encode. If you receive an image, run it through your image library, decode it to raw pixels, then encode it as a new JPEG or PNG. Any embedded payloads in the original file’s metadata or trailing data are destroyed in the re-encoding. What you store is a clean image you generated, not the attacker’s file.

Dan: That makes sense. What about documents? You can’t re-encode a contract PDF without potentially losing content.

Alice: For PDFs, you sanitise. Strip JavaScript — legal documents have no legitimate need for executable code. Strip launch actions, embedded files, form submission URLs. QPDF can linearise and clean PDFs. For DOCX files — strip VBA macros unconditionally. Reject macro-enabled formats like .docm and .xlsm at the allowlist level. Parse the internal XML with external entities disabled to prevent XXE — XML External Entity — attacks, where the parser is tricked into loading files or resources it shouldn’t. Remove external template references that could phone home or steal NTLM hashes.

Dan: That’s a lot of processing. How do you do it safely? What if the parsing library itself has vulnerabilities?

Alice: Which it will. ImageMagick had ImageTragick in 2016 — remote code execution via crafted images. Ghostscript had a remote code execution flaw in 2023 via PDF processing. LibTIFF has had buffer overflows. The answer is sandboxed processing. Every uploaded file goes through a pipeline: quarantine storage first — the file sits where it can’t be accessed via URL. Then a worker process in an isolated container — no network access, limited filesystem, minimal permissions — performs all validation and transformation. If the processing triggers a vulnerability, the attacker gets control of an empty sandbox with no access to your application or data.

Dan: Containers specifically? Or VMs?

Alice: Either works. The principle is isolation. A container with no network namespace, a read-only filesystem except for its working directory, dropped capabilities, and seccomp filtering is sufficient for most threat models. If you’re processing files from nation-state adversaries — which some legal firms do — VM isolation via Firecracker or gVisor provides stronger boundaries.

Dan: Once a file passes validation, how do you store it?

Alice: Three rules. First — random filenames. Generate a UUID — a universally unique identifier, a long random string — or content hash. Never use the user-supplied filename in your filesystem path. User-supplied filenames can contain path traversal sequences — dot-dot-slash — that overwrite system files. Second — store outside the web root. Files should never be directly accessible via URL. Serve them through a controlled endpoint that checks authorisation before each retrieval. Third — serve from a different origin. If your app lives at app.legalapp.com, serve files from files.legalapp.com. Same-origin policy prevents any script in a served file from accessing your application’s cookies or DOM.

Dan: What about the response headers when serving files?

Alice: Critical. Content-Disposition: attachment forces the browser to download rather than render inline — so even if the file contains HTML or scripts, it won’t execute in the browser context. X-Content-Type-Options: nosniff prevents the browser from guessing a file type different from the Content-Type you set. And always set an explicit, correct Content-Type based on your server-side detection, not the stored original claim.

Dan: Let’s talk about legal-specific formats. PDFs are the obvious one.

Alice: PDFs are probably the most complex file format in common use. The specification — ISO 32000 — supports JavaScript execution, launching external applications, embedded files, form submission to external URLs, and encrypted payloads. A PDF from opposing counsel could track when and where it was opened via an external reference — a digital tracking pixel. It could exploit a reader vulnerability for code execution. It could contain a phishing overlay that mimics your firm’s login page.

Dan: And DOCX files?

Alice: DOCX is a ZIP archive containing XML files. VBA macros — Visual Basic for Applications, small programs embedded inside documents — are the classic vector — arbitrary code execution if the user enables macros. But there are subtler attacks. External template references — the document silently loads a template from a remote server when opened, sending the user’s NTLM hash — a Windows authentication credential — in the authentication handshake. DDE fields — Dynamic Data Exchange — can execute system commands without macros. Even the XML itself can contain XXE payloads if your parser resolves external entities.

Dan: And TIFF? Those are common in legal scanning workflows.

Alice: Multi-page TIFFs are standard output from document scanners used in legal firms. The libtiff library has had recurring buffer overflow vulnerabilities — where malformed data overruns a program’s memory boundaries, letting attackers hijack execution. Malformed TIFF headers crash the parser in exploitable ways. Process them in your sandbox, keep the library updated, and consider re-encoding to PNG for storage if you don’t need the multi-page container format.

Dan: So the complete picture: allowlist permitted extensions, validate magic bytes, scan for malware, process in an isolated sandbox, strip dangerous elements format by format, store with random names outside the web root, serve from a separate origin with restrictive headers, and check authorisation on every retrieval.

Alice: Every step matters. Skip the sandbox and a parser vulnerability gives the attacker your server. Skip the separate origin and a polyglot gives them XSS. Skip the authorisation check on retrieval and any authenticated user can download any document. File upload security is the entire pipeline — not any single check.

Dan: Next episode — Webhook Security and SSRF. What happens when your server makes outbound requests on behalf of external services, and attackers exploit that trust relationship.

Alice: Until then, I’m Alice.

Dan: And I’m Dan.

Alice: Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.

Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.