Today’s Lesson
Security for Legal SaaS — Episode 36: Model Inversion and Membership Inference
When Your AI Leaks What It Learned
In Episode 35, we explored how vector databases can leak tenant data through embedding similarity queries. This episode moves one layer deeper: what happens when the model itself becomes the leak?
Model inversion and membership inference are two classes of privacy attack that target machine learning models — not their infrastructure, not their APIs, but the mathematical patterns the model learned during training. For legal AI systems trained on privileged documents, these attacks raise a question that no other industry faces quite so sharply: if an attacker can prove a specific document was in the training data, has the attorney-client privilege over that document been compromised?
Model Inversion: Reconstructing What the Model Saw
A model inversion attack attempts to reconstruct training data by carefully querying a model and analysing its outputs. Think of it like this: imagine a law firm's AI was trained to classify documents by matter type. An attacker who can query that classifier thousands of times — observing confidence scores, probability distributions, and output patterns — can gradually reconstruct characteristics of the documents the model was trained on.1
The technique was first demonstrated by Fredrikson et al. in 2015, who showed that a facial recognition model could be queried to reconstruct recognisable images of individuals in the training set.2 For legal AI, the data at risk is not faces but privileged communications, litigation strategy memos, and client financial records.
| Attack Type | What It Reveals | Legal Risk |
|---|---|---|
| Model inversion | Approximations of actual training data content | Privilege waiver if privileged content is reconstructable |
| Membership inference | Whether a specific document was used in training | Proves exposure of confidential material to the model |
| Attribute inference | Properties of training data (e.g., client names, matter types) | Reveals confidential client relationships |
Legal precedent alert: In United States v. Bradley Heppner (S.D.N.Y., February 2026), the court held that written exchanges with a publicly available AI platform are not protected by attorney-client privilege, partly because the AI provider's terms permitted data retention and use for model training.3 If a model trained on privileged data can be queried to reveal that data, the privilege argument becomes even harder to sustain.
Membership Inference: Proving a Document Was in the Training Set
Where model inversion tries to reconstruct what the model learned, membership inference asks a simpler but equally dangerous question: was this specific document — this contract, this legal memo, this client communication — part of the training data?
The seminal work by Shokri et al. (2017) showed that machine learning models behave measurably differently on data they were trained on versus data they have never seen.4 A model tends to be more "confident" — returning higher probability scores — on its training examples. An attacker with access to both the target model and a sample of known training and non-training data can build a classifier that distinguishes between the two.
For legal AI, the implications are severe:
- Privilege exposure: If an adversary can prove that a privileged memo was in a model's training data, they have evidence that the privileged information was disclosed to a third-party system — potentially waiving the privilege.3
- Conflict discovery: A membership inference attack could reveal that a law firm's AI was trained on documents from a specific client, exposing confidential client relationships.
- Regulatory evidence: Under data protection laws like the GDPR, a membership inference result could serve as evidence that personal data was processed without adequate consent.
How Practical Are These Attacks?
The honest answer: it depends on the model architecture, the attacker's access level, and the defences in place.
| Factor | Lower Risk | Higher Risk |
|---|---|---|
| Model access | Black-box (API only, no confidence scores) | White-box (model weights available) or grey-box (confidence scores returned) |
| Training data size | Large, diverse datasets (individual documents less memorised) | Small, specialised datasets (legal corpora are often small and domain-specific) |
| Model complexity | Simple models (less capacity to memorise) | Large overparameterised models (GPT-scale models memorise more) |
| Output detail | Binary yes/no outputs | Full probability distributions or logits |
Legal AI systems sit in the higher-risk column on several factors. They are typically fine-tuned on small, specialised corpora of legal documents — exactly the scenario where memorisation is most likely. Research published in 2024 demonstrated that membership inference attacks against fine-tuned language models achieve significantly higher accuracy when the fine-tuning dataset is small and domain-specific.5
OWASP LLM Top 10 (2025): The OWASP Top 10 for LLM Applications lists training data poisoning (LLM04) and sensitive information disclosure (LLM06) as top risks. Model inversion and membership inference are attack techniques that exploit LLM06 — extracting sensitive information that the model inadvertently memorised during training.6
Defences: Reducing What the Model Reveals
No single defence eliminates these risks entirely. The strategy is defence in depth — layering multiple controls to make attacks progressively harder.
1. Differential Privacy
Differential privacy adds carefully calibrated mathematical noise during training, ensuring that no single training example significantly influences the model's outputs. NIST SP 800-226 provides guidelines for evaluating differential privacy guarantees.7 The tradeoff: stronger privacy guarantees reduce model accuracy. For legal AI, this means calibrating the privacy budget (epsilon) to balance confidentiality protection against the model's usefulness for document classification or contract review.
2. Output Perturbation
Instead of modifying training, output perturbation adds noise to the model's responses at inference time. Rather than returning exact confidence scores (e.g., "92.7% contract, 7.3% memo"), the system rounds or randomises outputs ("likely contract"). This reduces the signal available to membership inference attacks without retraining the model.8
3. Access Controls on Model Endpoints
Rate-limit queries. Require authentication. Log every inference request. Monitor for the distinctive patterns of inference attacks — thousands of similar queries with slight variations, systematic probing of confidence boundaries. These controls do not prevent the attack mathematically, but they make it operationally harder and detectable.9
4. Restricting Confidence Scores
Many membership inference attacks depend on observing the model's confidence scores. If your API returns only the top prediction without probabilities, the attack surface shrinks substantially. The NIST AI Risk Management Framework recommends minimising unnecessary information in model outputs as a privacy-preserving measure.10
5. Training Data Governance
The most fundamental defence: do not train models on data that should not be in them. Maintain a clear inventory of training data. Separate privileged and non-privileged corpora. Use synthetic or anonymised data where possible. Document every dataset used in training with provenance metadata — which we will cover in detail in Episode 43.
A Decision Framework for Legal AI Builders
| Question | If Yes | If No |
|---|---|---|
| Does the model process privileged client data? | Apply differential privacy; restrict output detail; document training data provenance | Standard access controls may suffice |
| Is the model fine-tuned on a small, client-specific corpus? | High memorisation risk — consider synthetic data or federated learning | Lower individual-document risk, but still apply output controls |
| Does the API return confidence scores or logits? | Restrict to top-k predictions without scores; add output noise | Reduced membership inference surface |
| Is the model accessible to external users? | Rate limiting, authentication, anomaly detection are mandatory | Internal-only access reduces but does not eliminate risk |
What Hogan Lovells Gets Right
The law firm Hogan Lovells published a detailed analysis of model inversion and membership inference risks in the legal context, noting that "organisations should evaluate whether the AI model or system they are deploying has been tested for susceptibility to such attacks" and recommending that vendor contracts include specific provisions around training data governance and model security testing.11 This is the right instinct: technical controls and contractual protections are complementary, not alternatives.
What's Next
Episode 37 tackles Governed Writes and Human-in-the-Loop — the principle that your AI can draft, analyse, and recommend, but it should never autonomously file a document, send a client communication, or modify a legal record without a human professional reviewing and approving the action.
Sources & Further Reading
Sources & references
- WitnessAI, Model Inversion Attacks: How They Work, Risks, and Defenses.
- Zhou et al., Model Inversion Attacks: A Survey of Approaches and Countermeasures (arXiv, November 2024).
- K&L Gates, Generative AI Data, Attorney-Client Privilege, and the Work-Product Doctrine (February 2026).
- Shokri et al., Membership Inference Attacks Against Machine Learning Models (IEEE S&P, 2017).
- Colwell, The 2024-2025 MIA Landscape Reveals Relentless Evolution in Membership Inference Attack Sophistication.
- OWASP, Top 10 for LLM Applications 2025.
- NIST, SP 800-226: Guidelines for Evaluating Differential Privacy Guarantees.
- PMC, Algorithms That Remember: Model Inversion Attacks and Data Protection Law.
- Hogan Lovells, Model Inversion and Membership Inference: Understanding New AI Security Risks.
- NIST, AI Risk Management Framework (AI RMF 1.0).
- Hogan Lovells / JD Supra, Model Inversion and Membership Inference.
- Springer Nature, Defending Against Attacks in Deep Learning with Differential Privacy.