Security for Legal SaaS

Episode 10 · Module 3 · App Security

File Upload Security

18 May 2026 · 10:57 · Security for Legal SaaS

0:00 10:57

Every uploaded file is an opaque blob of bytes crossing a trust boundary. Alice and Dan cover polyglot files, zip bombs, magic byte validation, sandboxed processing pipelines, storage isolation, and the specific attack vectors in PDF, DOCX, and TIFF formats that make legal SaaS — where the document management system is the product — uniquely exposed.

Today’s Lesson

Security for Legal SaaS — Episode 10: File Upload Security

Every Upload Is a Payload

File uploads are the most dangerous feature in any web application. A file is an opaque blob of bytes that your server must receive, store, and potentially process. OWASP identifies unrestricted file upload as a critical web application vulnerability because it can lead to remote code execution, cross-site scripting, denial of service, and complete system compromise.

For legal SaaS, file uploads are unavoidable. Lawyers upload contracts, pleadings, evidence, exhibits, expert reports. The document management system is the product. You cannot eliminate file uploads — you must make them safe.

Key principle: A file's extension and Content-Type header are claims made by the uploader. Neither can be trusted. Validation must examine the file's actual content — its magic bytes, structure, and embedded elements.

Dangerous File Types

Direct Execution Threats

File Type Danger Attack Scenario
.php, .asp, .jsp Server-side execution Upload web shell, access via URL, execute arbitrary commands
.html, .svg Client-side execution Stored XSS when served from same origin
.exe, .bat, .ps1 OS execution Social engineering: “Download this exhibit”
.jar, .class Java execution Bypass server-side restrictions via deserialization

Polyglot Files

A polyglot is a file that is simultaneously valid as multiple file types. Researchers have demonstrated files that are valid JPEGs and valid JavaScript, valid PDFs and valid ZIP archives, and valid images that contain hidden HTML.

Attack: Upload a polyglot JPEG/JavaScript file. The server validates it as an image (magic bytes match JPEG). When served with an incorrect Content-Type or without X-Content-Type-Options: nosniff, the browser interprets it as JavaScript — executing the attacker’s code.

Archive Bombs (Zip Bombs)

A zip bomb is a compressed file that expands to an enormous size — gigabytes or terabytes. The classic 42.zip is 42KB compressed but expands to 4.5 petabytes across nested archives.

Legal SaaS impact: If your system automatically extracts uploaded archives (common for bulk evidence uploads), a zip bomb exhausts disk space and memory, causing denial of service. Nested archives amplify the effect exponentially.

Script-Containing Documents

Format Script Capability Example
PDF JavaScript (Acrobat), Launch actions Auto-execute on open; phishing overlays
DOCX VBA macros, ActiveX controls, DDE Remote code execution via macro-enabled documents
XLSX Macros, external data connections, DDE Data exfiltration via external cell references
SVG Inline JavaScript, event handlers XSS when rendered in browser
HTML/MHTML Full JavaScript execution Complete web application in a single file

Magic Byte Validation

Every file format has a characteristic byte sequence at the beginning — its magic bytes or file signature. Validating these is more reliable than checking file extensions:

Format Magic Bytes (hex) Offset
PDF 25 50 44 46 (%PDF) 0
DOCX/XLSX/PPTX 50 4B 03 04 (ZIP header) 0
JPEG FF D8 FF 0
PNG 89 50 4E 47 0D 0A 1A 0A 0
TIFF 49 49 2A 00 (little-endian) or 4D 4D 00 2A (big-endian) 0
GIF 47 49 46 38 (GIF8) 0
import magic  # python-magic library

def validate_file_type(file_bytes: bytes, expected_mime: str) -> bool:
    """Validate file content matches expected MIME type using libmagic."""
    detected = magic.from_buffer(file_bytes[:2048], mime=True)
    return detected == expected_mime

Warning: Magic byte validation alone is insufficient for security. Polyglot files have valid magic bytes for their cover format. A polyglot JPEG has valid JPEG magic bytes while also containing executable JavaScript. Magic bytes confirm format but do not guarantee safety. You must also validate internal structure and strip dangerous elements.

Sandboxed Processing

Files must be processed in isolation — never in your application’s main execution context. OWASP recommends processing uploads in sandboxed environments:

Processing Architecture

Upload → Quarantine Storage → Sandbox Worker → Validated Storage → Application
Stage Control Purpose
Quarantine Separate storage, no direct URL access Prevent execution before validation
Sandbox Container/VM with no network, limited filesystem Contain exploitation of parser vulnerabilities
Validation Magic bytes + structure check + virus scan Confirm file is what it claims
Transformation Re-encode images, flatten PDFs, strip macros Remove embedded threats
Validated storage Immutable object storage with random names Prevent path traversal and overwrite

Why Sandboxing Matters

File parsing libraries have vulnerabilities. ImageMagick’s “ImageTragick” (CVE-2016-3714) allowed remote code execution via crafted images — any server processing user-uploaded images with ImageMagick was vulnerable. Ghostscript has had multiple RCE vulnerabilities in PDF processing. LibreOffice has had command injection via document conversion.

If your processing runs in a sandboxed container with no network access and minimal filesystem permissions, exploitation of a parser vulnerability is contained rather than granting the attacker your entire server.

Storage Isolation

Never Store in Web-Accessible Directories

If uploaded files are stored in a directory served by your web server (e.g., /var/www/uploads/), an attacker who uploads a PHP file can access it via URL and achieve remote code execution. Store files outside the web root and serve them through a controlled endpoint that sets appropriate headers.

Storage Security Requirements

Requirement Implementation
Random filenames UUID or content-hash; never user-supplied names
No directory listing Disable index pages on storage endpoints
Separate origin Serve from files.legalapp.com, not app.legalapp.com
Content-Disposition: attachment Forces download instead of inline rendering
X-Content-Type-Options: nosniff Prevents MIME sniffing
No execution permissions Storage directory has no execute bits
Access control on retrieval Verify user authorisation before serving each file

Object Storage (S3-Pattern)

Cloud object storage (AWS S3, GCS, Azure Blob) provides inherent isolation — files cannot be “executed” by the storage service. Use pre-signed URLs with short expiry for authorised access:

# Generate pre-signed URL — expires in 5 minutes
url = s3_client.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'legal-docs', 'Key': f'matters/{matter_id}/{file_id}'},
    ExpiresIn=300
)

Legal Document Format Attack Vectors

PDF Attacks

PDFs are the most common document format in legal practice and one of the most complex file formats in existence. The PDF specification (ISO 32000) supports:

- JavaScript — auto-execute on open, form validation, dynamic content

- Launch actions — open external applications

- Embedded files — store executables within the PDF

- External references — phone-home when opened (tracking pixel equivalent)

- Form submission — send form data to external URLs

Mitigation: Process uploaded PDFs through a sanitisation pipeline that strips JavaScript, actions, and embedded files. Tools like QPDF can linearise and sanitise PDFs. For preview, render to images using a sandboxed PDF renderer.

DOCX Attack Vectors

DOCX files are ZIP archives containing XML. Attack vectors include:

- VBA macros — arbitrary code execution (macro-enabled .docm)

- External template references — document loads remote template on open (NTLM hash theft)

- DDE (Dynamic Data Exchange) — execute commands without macros

- Embedded OLE objects — nested malicious files

- XML External Entity (XXE) — if your parser resolves external entities in DOCX XML

Microsoft’s documentation on macro security acknowledges that macro-based malware remains one of the most common initial access vectors.

Mitigation: Strip macros from all uploaded DOCX files. Reject .docm and .xlsm extensions entirely. Parse DOCX XML with a non-resolving XML parser (disable external entities). Block external template references.

TIFF Attack Vectors

TIFF files — common in legal scanning workflows — have historically been vulnerable to buffer overflow attacks in parsing libraries (libtiff). TIFF supports multiple pages, compression schemes, and metadata formats that expand the attack surface.

Mitigation: Process TIFFs in a sandboxed environment. Re-encode to a simpler format (PNG) for storage if multi-page support isn’t needed. Keep libtiff updated.

Upload Validation Checklist

Check Implementation
File size limit Reject before full upload (Content-Length check + streaming limit)
Extension allowlist Only .pdf, .docx, .xlsx, .tiff, .png, .jpg
Magic byte validation Confirm header bytes match claimed type
MIME type verification Use libmagic, not user-supplied Content-Type
Virus scan ClamAV or cloud AV API on quarantined file
Archive depth limit Max 2 levels of nesting for ZIP/RAR
Archive expansion limit Abort if extracted size exceeds 100x compressed size
Macro stripping Remove VBA/DDE from Office documents
PDF sanitisation Strip JavaScript, actions, embedded files
Image re-encoding Decode and re-encode images to strip embedded payloads
Filename sanitisation Replace with UUID; never use user-supplied filename in filesystem paths

Conclusion

File uploads combine the highest risk with the highest business necessity in legal SaaS. Every document from a client or opposing party is potentially adversarial content crossing a trust boundary. The defence is layered: validate format, scan for malware, process in sandboxes, store in isolation, serve with restrictive headers, and never trust the client’s claims about what the file contains. Your document management system is only as secure as your file upload pipeline.

Next episode: Webhook Security and SSRF — when your server makes requests on behalf of external services, and attackers exploit that trust.

Sources & references

  1. OWASP, "Unrestricted File Upload." Vulnerability classification and impacts
  2. OWASP, "File Upload Cheat Sheet." Comprehensive upload security guidance
  3. PortSwigger Research, "Bypassing CSP using polyglot JPEGs"
  4. Wikipedia, "Zip bomb." Archive decompression attacks
  5. Wikipedia, "List of file signatures." Magic bytes reference
  6. ImageTragick, CVE-2016-3714. Remote code execution via image processing
  7. NVD, CVE-2023-36664. Ghostscript RCE via crafted PDF files
  8. AWS, "Amazon S3 Security Best Practices"
  9. ISO 32000-2:2020, "Document management — Portable document format"
  10. QPDF, "PDF transformation and inspection tool"
  11. Microsoft, "Macro malware." Macro-based attack vectors
  12. NVD, CVE-2023-1916. LibTIFF buffer overflow vulnerability