Episode 36 · Module 8 · AI Security

Model Inversion and Membership Inference

19 May 2026 · 8:41 · Security for Legal SaaS

8:41 8:41

In Episode 35, we explored how vector databases can leak tenant data through embedding similarity queries. This episode moves one layer deeper: what happens when the model itself becomes the leak? Model inversion and membership inference are two classes of privacy attack that target machine learning models — not their infrastructure, not their APIs, but the mathematical patterns the model learned during training. For legal AI systems trained on privileged documents, these attacks raise a question that no other industry faces quite so sharply: if an attacker can prove a specific document was in the training data, has the attorney-client privilege over that document been compromised?

Today’s Lesson

Security for Legal SaaS — Episode 36: Model Inversion and Membership Inference

When Your AI Leaks What It Learned

In Episode 35, we explored how vector databases can leak tenant data through embedding similarity queries. This episode moves one layer deeper: what happens when the model itself becomes the leak?

Model inversion and membership inference are two classes of privacy attack that target machine learning models — not their infrastructure, not their APIs, but the mathematical patterns the model learned during training. For legal AI systems trained on privileged documents, these attacks raise a question that no other industry faces quite so sharply: if an attacker can prove a specific document was in the training data, has the attorney-client privilege over that document been compromised?

Model Inversion: Reconstructing What the Model Saw

A model inversion attack attempts to reconstruct training data by carefully querying a model and analysing its outputs. Think of it like this: imagine a law firm's AI was trained to classify documents by matter type. An attacker who can query that classifier thousands of times — observing confidence scores, probability distributions, and output patterns — can gradually reconstruct characteristics of the documents the model was trained on.¹

The technique was first demonstrated by Fredrikson et al. in 2015, who showed that a facial recognition model could be queried to reconstruct recognisable images of individuals in the training set.² For legal AI, the data at risk is not faces but privileged communications, litigation strategy memos, and client financial records.

Attack Type	What It Reveals	Legal Risk
Model inversion	Approximations of actual training data content	Privilege waiver if privileged content is reconstructable
Membership inference	Whether a specific document was used in training	Proves exposure of confidential material to the model
Attribute inference	Properties of training data (e.g., client names, matter types)	Reveals confidential client relationships

Legal precedent alert: In United States v. Bradley Heppner (S.D.N.Y., February 2026), the court held that written exchanges with a publicly available AI platform are not protected by attorney-client privilege, partly because the AI provider's terms permitted data retention and use for model training.³ If a model trained on privileged data can be queried to reveal that data, the privilege argument becomes even harder to sustain.

Membership Inference: Proving a Document Was in the Training Set

Where model inversion tries to reconstruct what the model learned, membership inference asks a simpler but equally dangerous question: was this specific document — this contract, this legal memo, this client communication — part of the training data?

The seminal work by Shokri et al. (2017) showed that machine learning models behave measurably differently on data they were trained on versus data they have never seen.⁴ A model tends to be more "confident" — returning higher probability scores — on its training examples. An attacker with access to both the target model and a sample of known training and non-training data can build a classifier that distinguishes between the two.

For legal AI, the implications are severe:

Privilege exposure: If an adversary can prove that a privileged memo was in a model's training data, they have evidence that the privileged information was disclosed to a third-party system — potentially waiving the privilege.³
Conflict discovery: A membership inference attack could reveal that a law firm's AI was trained on documents from a specific client, exposing confidential client relationships.
Regulatory evidence: Under data protection laws like the GDPR, a membership inference result could serve as evidence that personal data was processed without adequate consent.

How Practical Are These Attacks?

The honest answer: it depends on the model architecture, the attacker's access level, and the defences in place.

Factor	Lower Risk	Higher Risk
Model access	Black-box (API only, no confidence scores)	White-box (model weights available) or grey-box (confidence scores returned)
Training data size	Large, diverse datasets (individual documents less memorised)	Small, specialised datasets (legal corpora are often small and domain-specific)
Model complexity	Simple models (less capacity to memorise)	Large overparameterised models (GPT-scale models memorise more)
Output detail	Binary yes/no outputs	Full probability distributions or logits

Legal AI systems sit in the higher-risk column on several factors. They are typically fine-tuned on small, specialised corpora of legal documents — exactly the scenario where memorisation is most likely. Research published in 2024 demonstrated that membership inference attacks against fine-tuned language models achieve significantly higher accuracy when the fine-tuning dataset is small and domain-specific.⁵

OWASP LLM Top 10 (2025): The OWASP Top 10 for LLM Applications lists training data poisoning (LLM04) and sensitive information disclosure (LLM06) as top risks. Model inversion and membership inference are attack techniques that exploit LLM06 — extracting sensitive information that the model inadvertently memorised during training.⁶

Defences: Reducing What the Model Reveals

No single defence eliminates these risks entirely. The strategy is defence in depth — layering multiple controls to make attacks progressively harder.

1. Differential Privacy

Differential privacy adds carefully calibrated mathematical noise during training, ensuring that no single training example significantly influences the model's outputs. NIST SP 800-226 provides guidelines for evaluating differential privacy guarantees.⁷ The tradeoff: stronger privacy guarantees reduce model accuracy. For legal AI, this means calibrating the privacy budget (epsilon) to balance confidentiality protection against the model's usefulness for document classification or contract review.

2. Output Perturbation

Instead of modifying training, output perturbation adds noise to the model's responses at inference time. Rather than returning exact confidence scores (e.g., "92.7% contract, 7.3% memo"), the system rounds or randomises outputs ("likely contract"). This reduces the signal available to membership inference attacks without retraining the model.⁸

3. Access Controls on Model Endpoints

Rate-limit queries. Require authentication. Log every inference request. Monitor for the distinctive patterns of inference attacks — thousands of similar queries with slight variations, systematic probing of confidence boundaries. These controls do not prevent the attack mathematically, but they make it operationally harder and detectable.⁹

4. Restricting Confidence Scores

Many membership inference attacks depend on observing the model's confidence scores. If your API returns only the top prediction without probabilities, the attack surface shrinks substantially. The NIST AI Risk Management Framework recommends minimising unnecessary information in model outputs as a privacy-preserving measure.¹⁰

5. Training Data Governance

The most fundamental defence: do not train models on data that should not be in them. Maintain a clear inventory of training data. Separate privileged and non-privileged corpora. Use synthetic or anonymised data where possible. Document every dataset used in training with provenance metadata — which we will cover in detail in Episode 43.

A Decision Framework for Legal AI Builders

Question	If Yes	If No
Does the model process privileged client data?	Apply differential privacy; restrict output detail; document training data provenance	Standard access controls may suffice
Is the model fine-tuned on a small, client-specific corpus?	High memorisation risk — consider synthetic data or federated learning	Lower individual-document risk, but still apply output controls
Does the API return confidence scores or logits?	Restrict to top-k predictions without scores; add output noise	Reduced membership inference surface
Is the model accessible to external users?	Rate limiting, authentication, anomaly detection are mandatory	Internal-only access reduces but does not eliminate risk

What Hogan Lovells Gets Right

The law firm Hogan Lovells published a detailed analysis of model inversion and membership inference risks in the legal context, noting that "organisations should evaluate whether the AI model or system they are deploying has been tested for susceptibility to such attacks" and recommending that vendor contracts include specific provisions around training data governance and model security testing.¹¹ This is the right instinct: technical controls and contractual protections are complementary, not alternatives.

What's Next

Episode 37 tackles Governed Writes and Human-in-the-Loop — the principle that your AI can draft, analyse, and recommend, but it should never autonomously file a document, send a client communication, or modify a legal record without a human professional reviewing and approving the action.

Sources & Further Reading

Sources & references

WitnessAI, Model Inversion Attacks: How They Work, Risks, and Defenses.
Zhou et al., Model Inversion Attacks: A Survey of Approaches and Countermeasures (arXiv, November 2024).
K&L Gates, Generative AI Data, Attorney-Client Privilege, and the Work-Product Doctrine (February 2026).
Shokri et al., Membership Inference Attacks Against Machine Learning Models (IEEE S&P, 2017).
Colwell, The 2024-2025 MIA Landscape Reveals Relentless Evolution in Membership Inference Attack Sophistication.
OWASP, Top 10 for LLM Applications 2025.
NIST, SP 800-226: Guidelines for Evaluating Differential Privacy Guarantees.
PMC, Algorithms That Remember: Model Inversion Attacks and Data Protection Law.
Hogan Lovells, Model Inversion and Membership Inference: Understanding New AI Security Risks.
NIST, AI Risk Management Framework (AI RMF 1.0).
Hogan Lovells / JD Supra, Model Inversion and Membership Inference.
Springer Nature, Defending Against Attacks in Deep Learning with Differential Privacy.

Alice: Welcome back to Security for Legal SaaS. I'm Alice.

Dan: And I'm Dan. Episode 36 — we're staying in Module 8, AI security. Last time we talked about embedding security and vector databases. Today we're going deeper into something that honestly sounds like science fiction — model inversion and membership inference. Alice, what are we actually talking about here?

Alice: We're talking about attacks that target the model itself. Not the server it runs on, not the API, not the database behind it — the actual mathematical patterns the model learned during training. Model inversion is when an attacker queries your AI over and over, carefully analysing the responses, and gradually reconstructs pieces of the data it was trained on. Membership inference is the simpler version — the attacker just wants to know whether a specific document was in the training data at all.

Dan: Mm. So for legal AI, that would mean — could someone figure out that a specific privileged memo was used to train a law firm's document classifier?

Alice: Exactly. And that's not just a technical problem — it's a privilege problem. There was a federal case earlier this year, United States v. Heppner in the Southern District of New York, where the court ruled that exchanges with a public AI platform aren't protected by attorney-client privilege. Part of the reasoning was that the AI provider's terms of service allowed them to retain and train on user data. Now imagine a model that was fine-tuned on privileged documents. If someone can prove through a membership inference attack that a specific privileged document was in that training data, you've got evidence that confidential information was disclosed to a third-party system. The privilege argument gets very difficult at that point.

Dan: Right. So how does model inversion actually work? Walk me through it like I'm not a machine learning researcher — because I'm definitely not.

Alice: Think of it this way. Imagine your firm built an AI that classifies documents — it looks at a document and tells you whether it's a contract, a legal memo, a court filing, whatever. That classifier learned patterns from the documents it was trained on. Now, an attacker starts feeding it carefully crafted inputs — thousands of them — and watching the confidence scores. The model might say "I'm 94% sure this is a contract." By methodically varying the inputs and watching how confidence changes, the attacker can reverse-engineer characteristics of the actual documents the model saw during training. It's like someone figuring out what recipe you used by tasting your cooking a thousand times and adjusting their guess each time.

Dan: Hmm. And membership inference is different?

Alice: <sigh> It's actually simpler, which is what makes it scarier. Machine learning models behave differently on data they've seen before versus data they haven't. They tend to be more confident on their own training examples. A researcher named Shokri demonstrated this back in 2017. You build a second model — an "attack model" — that learns to distinguish between "the target model saw this data" and "the target model didn't see this data," based on the confidence patterns. And it works surprisingly well, especially on smaller, specialised datasets.

Dan: And legal datasets are exactly that — small and specialised.

Alice: That's the problem. If you're training on millions of random internet documents, any individual document has a tiny footprint in the model. But if you fine-tune on a few thousand legal memos from a specific practice area? Each document leaves a much larger imprint. The model memorises more. The membership inference attack gets easier. Research from 2024 and 2025 confirms this — fine-tuned models on small domain-specific corpora are significantly more vulnerable than large general-purpose models.

Dan: Yeah. So what can you actually do about it? Can you prevent these attacks?

Alice: You can't eliminate them entirely, but you can make them much harder. The first defence is differential privacy — a technique where you add carefully calibrated noise during training. The noise is designed so that no single training example significantly changes the model's behaviour. NIST published guidelines for evaluating differential privacy, SP 800-226, because this is becoming a federal-level concern. The tradeoff is that stronger privacy reduces model accuracy. You're deliberately making the model slightly worse to protect the data.

Dan: Mm. That sounds like a hard sell to a partner who wants the most accurate AI possible.

Alice: It is. But the alternative is a model that memorises privileged documents and can be queried to reveal them. The second defence is simpler — restrict what the model tells the outside world. Most membership inference attacks depend on seeing detailed confidence scores. If your API returns just the top prediction — "this is a contract" — without saying "I'm 92.7% sure," you've cut off most of the signal the attacker needs. NIST's AI Risk Management Framework recommends minimising unnecessary detail in model outputs for exactly this reason.

Dan: Right. What about just controlling who can access the model in the first place?

Alice: That's the third layer — access controls. Rate-limit queries. Require authentication. Monitor for the patterns that look like inference attacks — someone sending thousands of slightly different queries in quick succession. These controls don't make the attack mathematically impossible, but they make it operationally much harder and, critically, detectable. If your logs show someone making ten thousand classification requests in an hour, that should trigger an alert.

Dan: Mm-hmm. And there's also just not putting sensitive data in the training set in the first place?

Alice: That's the most fundamental defence, and it's the one people skip. Maintain a clear inventory of what's in your training data. Separate privileged documents from general legal knowledge. Use synthetic or anonymised data where you can. If your contract classifier doesn't need to see real client names and dollar amounts to learn what a contract looks like, strip those out before training. And document everything — which datasets were used, who authorised their inclusion, what the provenance is. We'll cover that documentation piece in Episode 43 when we talk about provenance chains.

Dan: Hmm. I want to make sure we're being honest with the audience here. How practical are these attacks right now? Is this something a solo practitioner building a case management tool needs to worry about, or is this more of a concern for large firms with custom AI systems?

Alice: Fair question. If you're using a commercial AI service like a hosted LLM through an API, the attack surface is mostly on the provider's side — and the major providers are applying defences. Where it gets real is when you're fine-tuning models on your own data, deploying models internally, or building custom classifiers trained on client documents. That's where the training data is yours, the model is yours, and the risk is yours. And the legal AI space is increasingly moving in that direction — firms want models customised to their practice areas, their precedents, their document styles. Every custom fine-tuned model is a potential target.

Dan: Yeah. So the advice is — think about this before you fine-tune, not after.

Alice: Exactly. Apply differential privacy during training. Restrict confidence scores in your outputs. Rate-limit and monitor your endpoints. And most importantly, know what's in your training data and whether it should be there. If you wouldn't be comfortable explaining to a judge exactly which documents your model was trained on, you have a problem that no technical defence can solve.

Dan: Next episode — Governed Writes and Human-in-the-Loop. Why your AI should draft but never file.

Alice: Until then, I'm Alice.

Dan: And I'm Dan.

Alice: Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.

Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.