Episode 39 · Module 8 · AI Security

Redaction Pipelines for Cloud AI

19 May 2026 · 8:57 · Security for Legal SaaS

8:57 8:57

In Episode 38, we built the inference gateway that controls access to cloud AI providers. This episode addresses what flows through that gateway: the actual content of your prompts. Before you send a contract, a legal memo, or a case summary to a cloud LLM, you should ask a simple question — does the AI provider actually need to see your client's name, their social security number, the dollar amount of the deal, or the names of the parties in litigation?

Today’s Lesson

Security for Legal SaaS — Episode 39: Redaction Pipelines for Cloud AI

Does the Cloud Need to See Your Client's Name?

Usually, the answer is no. A cloud LLM can summarise a contract, extract obligations, or review clause language without knowing that the parties are Acme Corp and Baker Industries. It just needs the structure and the text. Everything else is a leak waiting to happen.

Why Redaction Matters for Legal AI

When you send a prompt to OpenAI, Anthropic, Google, or any other cloud LLM provider, that data leaves your network and enters the provider's infrastructure. Even with zero-data-retention agreements, the data traverses networks you don't control, is processed on hardware you don't own, and is subject to the provider's security posture — not yours.¹

For legal data, this creates specific risks:

Risk	Description	Legal Consequence
Privilege waiver	Privileged content shared with a third-party service	Potential waiver of attorney-client privilege (see EP36, Heppner)
Confidentiality breach	Client PII or deal terms accessible to the provider	Violation of ABA Model Rule 1.6(c), data protection regulations
Cross-contamination	Provider uses your data to improve models (if not contractually prohibited)	Client data influences outputs for other users
Subpoena exposure	Provider served with subpoena for data they hold	Your client's data produced in litigation involving the provider, not your client

ABA Formal Opinion 512 specifically requires lawyers to understand "how GAI uses data" and to implement "adequate safeguards to ensure that data processed by GAI is secure."² Technical redaction is one of the most direct safeguards available.

Named Entity Recognition: The Core Technology

Named Entity Recognition — NER — is a natural language processing technique that identifies and classifies specific entities in text: person names, organisation names, locations, dates, financial amounts, identification numbers, and other structured data. NER is the engine that powers automated redaction.

Modern NER systems use multiple approaches in combination:³

Approach	How It Works	Strengths	Weaknesses
Rule-based (regex)	Pattern matching for structured data (SSNs, phone numbers, emails)	High precision on structured formats	Cannot handle unstructured names or context-dependent entities
Statistical NER models	Machine learning models trained on labelled text (spaCy, Flair)	Good general-purpose entity detection	May miss domain-specific entities (case numbers, matter IDs)
Transformer-based NER	Deep learning models (BERT, GLiNER) fine-tuned for entity recognition	Highest accuracy, context-aware	Computationally expensive; requires GPU for real-time use
Hybrid	Combines rules, statistical models, and transformers	Best overall performance	More complex to maintain and tune

A recent study on financial documents reported that a hybrid NER approach achieved 94.7% precision, 89.4% recall, and an F1 score of 91.1%.⁴ Legal documents present similar challenges — a mix of structured identifiers (case numbers, statute references) and unstructured entities (party names, judge names) that require both pattern matching and contextual understanding.

Microsoft Presidio: An Open-Source Reference

Microsoft Presidio is the most widely adopted open-source PII redaction framework. It combines NER models (spaCy by default), regular expressions, and checksum validation to detect over 30 entity types including names, addresses, Social Security numbers, financial account numbers, and email addresses.⁵

Presidio's architecture has two components:

Analyzer: Detects PII entities in text, returning their type, location, and confidence score
Anonymizer: Replaces, masks, hashes, or encrypts detected entities based on configurable rules

Important caveat: Presidio's documentation explicitly states: "Because it is using automated detection mechanisms, there is no guarantee that Presidio will find all sensitive information."⁵ For legal applications, this means Presidio (or any automated tool) should be a first pass, not the only pass. Defence in depth applies to redaction too.

Configurable Redaction Levels

Not all prompts require the same level of redaction. A pipeline should support multiple levels depending on the data sensitivity and the task:

Level	What's Redacted	Example	Use Case
Full redaction	All PII replaced with type tags	"The agreement between [ORG_1] and [ORG_2] dated [DATE_1]..."	Sending to cloud AI for clause analysis
Partial redaction	Names and identifiers redacted; dates and amounts preserved	"The agreement between [ORG_1] and [ORG_2] dated January 15, 2026 for $2.5M..."	Financial analysis where amounts matter
Pseudonymised	PII replaced with consistent fake values	"The agreement between Maple Corp and Pine Industries dated March 3, 2025..."	When the AI needs realistic-looking data for better results
None	No redaction (local model only)	Original text sent as-is	Self-hosted model within your network perimeter

Pseudonymisation deserves special attention. Some AI tasks produce better results when the input looks like real data rather than placeholder tags. Replacing "[ORG_1]" with a consistent fake name like "Maple Corp" throughout the document preserves the readability that helps the LLM understand context — while ensuring the actual client name never reaches the cloud.⁶

Round-Trip Reconstruction: Redact, Send, Re-Hydrate

Redaction is only half the pipeline. When the AI returns its response, you need to map the redacted entities back to their real values — a process called re-hydration or reconstruction.

The Pipeline

1. Original document enters pipeline
2. NER identifies entities → stored in a REDACTION MAP (entity → placeholder)
3. Document redacted using the map
4. Redacted document sent to cloud LLM
5. LLM response received (contains placeholders)
6. Redaction map applied in reverse → real entities restored
7. Reconstructed response delivered to the user

The redaction map is the sensitive artefact. It must be stored securely — encrypted at rest, access-controlled, and never sent to the cloud provider. If the map is compromised, every redacted prompt can be reconstructed.⁷

Handling Edge Cases

Edge Case	Problem	Solution
Entity in AI response not in original	LLM generates new entity references	Map only applies to known entities; new references pass through unchanged
Partial entity matches	"John" appears both as a name and in "St. John's Court"	Context-aware NER; maintain an exclusion list for legal terminology
Cross-reference consistency	Same entity must get the same placeholder across a multi-document prompt	Use deterministic mapping (e.g., hash-based) with a session-scoped map
Nested entities	"John Smith of Baker & McKenzie LLP" contains both a person and an organisation	Redact from longest match first; maintain entity hierarchy

Contractual Protections vs. Technical Protections

Enterprise agreements with AI providers often include zero-data-retention clauses, prohibitions on using customer data for model training, and SOC 2 / ISO 27001 compliance attestations. These are important but insufficient on their own.⁸

Protection Type	What It Covers	What It Doesn't Cover
Contractual (DPA, BAA)	Provider's obligations regarding data handling, breach notification, liability	Data in transit, provider-side vulnerabilities, subpoena exposure
Technical (redaction)	Prevents sensitive data from reaching the provider at all	Entities missed by NER, context that reveals sensitive information without named entities

The strongest position is both: a contractual agreement that restricts data use, combined with technical redaction that ensures sensitive data never reaches the provider regardless of contract compliance. If the contract is breached, your data wasn't there to be exposed.

Building a Legal-Specific NER Model

Generic NER models miss legal-specific entities. Consider extending your redaction pipeline to detect:

Matter numbers and case identifiers (e.g., "Case No. 2025-CV-04521")
Court names and judge names that reveal jurisdiction and strategy
Law firm names that reveal representation relationships
Privilege designations ("Attorney-Client Privileged", "Work Product")
IOLTA account numbers and trust account references
Bar numbers and attorney registration identifiers

Tools like GLiNER support zero-shot NER — you define entity types without retraining the model — making it practical to add legal-specific entity types without a machine learning team.⁹

Practical Architecture

For a legal SaaS platform using cloud AI:

Inference gateway (Episode 38) intercepts every outbound prompt
NER pipeline scans the prompt, generates a redaction map, and redacts the text
Redacted prompt is forwarded to the cloud LLM provider
Response is received and re-hydrated using the redaction map
Redaction map is stored encrypted, access-controlled, and retained per your data retention policy
Audit log records what was redacted, when, and which provider received the redacted version

This pipeline adds latency — typically 100-500ms for NER processing depending on document length and model complexity. For most legal workflows (contract review, research, drafting), this latency is imperceptible.

What's Next

Episode 40 addresses Local vs. Cloud AI — Security Boundaries — the decision framework for when to send data to the cloud (with redaction) versus when to keep everything on premises by running the model locally.

Sources & Further Reading

Sources & references

Prediction Guard, The Complete Guide to PII Detection and Redaction Tools for AI Pipelines in Regulated Industries.
ABA, Formal Opinion 512: Generative Artificial Intelligence Tools (July 2024).
Elastic Observability Labs, Using NLP and Pattern Matching to Detect, Assess, and Redact PII in Logs.
Nature Scientific Reports, A Hybrid Rule-Based NLP and Machine Learning Approach for PII Detection and Anonymization in Financial Documents.
Microsoft, Presidio: Data Protection and De-Identification SDK.
Justee, Redact PII from Legal Documents for AI.
Protecto, Comparing Best NER Models for PII Identification.
Spellbook, Most Private AI for Lawyers: Why Zero Data Retention Wins in 2026.
DEV Community, The Next Generation of Privacy: Using Docling & GLiNER's Advanced NER.
Tonic.ai, Named Entity Recognition for Data Compliance Automation.
Microsoft Learn, Transparency Note — Named Entity Recognition Feature of Azure AI Language.

Alice: Welcome back to Security for Legal SaaS. I'm Alice.

Dan: And I'm Dan. Episode 39 — redaction pipelines for cloud AI. Alice, last episode we talked about protecting the API keys that connect to cloud AI. This time we're talking about what actually goes through that connection?

Alice: Right. Here's the question every legal AI builder should ask before sending a prompt to OpenAI or Anthropic or Google: does the cloud provider actually need to see your client's name? The dollar amount of the deal? The names of the parties in litigation? Usually, the answer is no. The AI can summarise a contract, extract obligations, review clause language without knowing it's about Acme Corp and Baker Industries. It just needs the structure and the text. Everything identifiable is a leak waiting to happen.

Dan: Mm. So you strip that stuff out before sending it?

Alice: Exactly. That's a redaction pipeline. Before any prompt leaves your network, it passes through a system that identifies sensitive information — client names, organisation names, dates, financial amounts, case numbers, identification numbers — and replaces them with placeholders. The prompt that reaches the cloud says "The agreement between ORG_1 and ORG_2 dated DATE_1" instead of the real names and dates. The AI processes it, sends back its analysis, and then your system maps the placeholders back to the real values. Your user sees the full result. The cloud provider never saw the sensitive data.

Dan: Right. And the technology that identifies those entities in the text — that's NER? Named Entity Recognition?

Alice: NER, yes — it's a natural language processing technique that identifies and classifies specific entities in text. Person names, organisations, locations, dates, financial figures, identification numbers. Modern NER systems combine multiple approaches. Pattern matching with regular expressions — that's great for structured data like Social Security numbers or email addresses, anything with a predictable format. Statistical models trained on labelled text — those handle names and organisations. And then deep learning transformer models — those are the most accurate because they understand context.

Dan: Hmm. How accurate are we talking?

Alice: A recent study on financial documents — which have similar complexity to legal text — reported a hybrid approach achieving about 95% precision and 89% recall. That's good, but notice what it means: roughly one in ten sensitive entities might be missed. For legal AI, that's why automated redaction should be a first pass, not the only pass. Defence in depth applies to redaction the same way it applies to everything else we've discussed in this series.

Dan: Yeah. What's the go-to tool for building this?

Alice: Microsoft Presidio is the most widely adopted open-source option. It combines NER models, regular expressions, and checksum validation to detect over 30 entity types. It has two components — an Analyzer that finds the entities and scores its confidence, and an Anonymizer that replaces, masks, or encrypts them based on rules you configure. It uses spaCy's NER model by default, but you can swap in more powerful models. And to their credit, the documentation is honest — it says explicitly that automated detection cannot guarantee it will find all sensitive information.

Dan: Mm. So what does the actual pipeline look like end to end?

Alice: Five steps. First, the original document enters the pipeline — let's say a contract a lawyer wants summarised. Second, the NER system scans it and identifies every entity — "John Smith" is a person, "Acme Corporation" is an organisation, "$2.5 million" is a financial amount. Each entity gets mapped to a placeholder — John Smith becomes PERSON_1, Acme Corporation becomes ORG_1. That mapping is stored in what we call a redaction map. Third, the redacted document is sent to the cloud LLM. Fourth, the response comes back with placeholders in it. Fifth, the redaction map is applied in reverse — the placeholders are replaced with the real values, and the user sees the complete result.

Dan: Mm-hmm. And that redaction map — that's now the sensitive thing you need to protect?

Alice: <sigh> Exactly. The redaction map is the Rosetta Stone. If someone gets the map, they can reconstruct every redacted prompt. So it has to be encrypted at rest, access-controlled, and stored according to your data retention policy. Never send the map to the cloud. Never store it in the same system as the redacted prompts. It stays within your perimeter.

Dan: Right. There's something I want to ask about the redaction levels. Does every prompt need full redaction? What about cases where the AI needs to see dates or dollar amounts to do its job properly?

Alice: Good question. A well-designed pipeline supports multiple levels. Full redaction replaces everything identifiable — names, dates, amounts, all of it. That's appropriate when you're asking the AI to analyse clause structure or identify obligation types. Partial redaction keeps dates and amounts but strips names and identifiers — useful when the AI needs to do financial analysis. And then there's pseudonymisation, which is interesting. Instead of replacing "Acme Corporation" with a placeholder tag, you replace it with a fake but realistic name like "Maple Corp." The AI gets data that looks real, which can actually improve the quality of the response, but the actual client identity never reaches the cloud.

Dan: Hmm. That pseudonymisation approach is clever. The AI doesn't know the difference between a fake name and a real one.

Alice: And the results are often better than with placeholder tags, because the model was trained on text that contains real names, not tags like ORG_1. The consistency matters too — "Maple Corp" needs to be "Maple Corp" throughout the entire document and across related prompts in the same session. Deterministic mapping handles this — the same input entity always produces the same pseudonym within a session.

Dan: Yeah. What about legal-specific entities that a general NER model might miss?

Alice: This is where generic tools fall short. A general NER model catches names and addresses, but it won't recognise a matter number like "Case No. 2025-CV-04521" as sensitive. It won't flag judge names that reveal which court your case is in. It won't catch IOLTA trust account numbers or bar registration numbers. For legal AI, you need to extend the NER pipeline with legal-specific entity types. Newer tools like GLiNER support zero-shot entity recognition — you define the entity types you want without retraining the model. You tell it "find matter numbers, court names, privilege designations" and it learns to identify them from context.

Dan: Mm. Now, I have to ask — some firms sign data processing agreements with AI providers that say "we won't train on your data, we won't retain your prompts." Isn't that enough?

Alice: It's necessary but not sufficient. Contractual protections cover the provider's obligations. But they don't protect data in transit. They don't protect against a vulnerability on the provider's side. They don't protect you if the provider is served with a subpoena for data they processed. The strongest position is both — a contract that restricts data use AND technical redaction that ensures the sensitive data was never there in the first place. If the contract is breached, your client's data wasn't exposed because it never left your network.

Dan: Right. Belt and suspenders.

Alice: For privileged legal data, it's the minimum standard. ABA Formal Opinion 512 requires lawyers to understand how AI tools use data and implement "adequate safeguards." A redaction pipeline running through your inference gateway is one of the most concrete safeguards you can point to.

Dan: Next episode — Local vs. Cloud AI. When does it make sense to skip the cloud entirely and run the model on your own hardware?

Alice: Until then, I'm Alice.

Dan: And I'm Dan.

Alice: Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.

Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.