Security for Legal SaaS

Episode 36 · Module 8 · AI Security

Model Inversion and Membership Inference

19 May 2026 · 8:41 · Security for Legal SaaS

8:41 8:41

In Episode 35, we explored how vector databases can leak tenant data through embedding similarity queries. This episode moves one layer deeper: what happens when the model itself becomes the leak? Model inversion and membership inference are two classes of privacy attack that target machine learning models — not their infrastructure, not their APIs, but the mathematical patterns the model learned during training. For legal AI systems trained on privileged documents, these attacks raise a question that no other industry faces quite so sharply: if an attacker can prove a specific document was in the training data, has the attorney-client privilege over that document been compromised?

Today’s Lesson

Security for Legal SaaS — Episode 36: Model Inversion and Membership Inference

When Your AI Leaks What It Learned

In Episode 35, we explored how vector databases can leak tenant data through embedding similarity queries. This episode moves one layer deeper: what happens when the model itself becomes the leak?

Model inversion and membership inference are two classes of privacy attack that target machine learning models — not their infrastructure, not their APIs, but the mathematical patterns the model learned during training. For legal AI systems trained on privileged documents, these attacks raise a question that no other industry faces quite so sharply: if an attacker can prove a specific document was in the training data, has the attorney-client privilege over that document been compromised?

Model Inversion: Reconstructing What the Model Saw

A model inversion attack attempts to reconstruct training data by carefully querying a model and analysing its outputs. Think of it like this: imagine a law firm's AI was trained to classify documents by matter type. An attacker who can query that classifier thousands of times — observing confidence scores, probability distributions, and output patterns — can gradually reconstruct characteristics of the documents the model was trained on.1

The technique was first demonstrated by Fredrikson et al. in 2015, who showed that a facial recognition model could be queried to reconstruct recognisable images of individuals in the training set.2 For legal AI, the data at risk is not faces but privileged communications, litigation strategy memos, and client financial records.

Attack Type What It Reveals Legal Risk
Model inversion Approximations of actual training data content Privilege waiver if privileged content is reconstructable
Membership inference Whether a specific document was used in training Proves exposure of confidential material to the model
Attribute inference Properties of training data (e.g., client names, matter types) Reveals confidential client relationships
Legal precedent alert: In United States v. Bradley Heppner (S.D.N.Y., February 2026), the court held that written exchanges with a publicly available AI platform are not protected by attorney-client privilege, partly because the AI provider's terms permitted data retention and use for model training.3 If a model trained on privileged data can be queried to reveal that data, the privilege argument becomes even harder to sustain.

Membership Inference: Proving a Document Was in the Training Set

Where model inversion tries to reconstruct what the model learned, membership inference asks a simpler but equally dangerous question: was this specific document — this contract, this legal memo, this client communication — part of the training data?

The seminal work by Shokri et al. (2017) showed that machine learning models behave measurably differently on data they were trained on versus data they have never seen.4 A model tends to be more "confident" — returning higher probability scores — on its training examples. An attacker with access to both the target model and a sample of known training and non-training data can build a classifier that distinguishes between the two.

For legal AI, the implications are severe:

How Practical Are These Attacks?

The honest answer: it depends on the model architecture, the attacker's access level, and the defences in place.

Factor Lower Risk Higher Risk
Model access Black-box (API only, no confidence scores) White-box (model weights available) or grey-box (confidence scores returned)
Training data size Large, diverse datasets (individual documents less memorised) Small, specialised datasets (legal corpora are often small and domain-specific)
Model complexity Simple models (less capacity to memorise) Large overparameterised models (GPT-scale models memorise more)
Output detail Binary yes/no outputs Full probability distributions or logits

Legal AI systems sit in the higher-risk column on several factors. They are typically fine-tuned on small, specialised corpora of legal documents — exactly the scenario where memorisation is most likely. Research published in 2024 demonstrated that membership inference attacks against fine-tuned language models achieve significantly higher accuracy when the fine-tuning dataset is small and domain-specific.5

OWASP LLM Top 10 (2025): The OWASP Top 10 for LLM Applications lists training data poisoning (LLM04) and sensitive information disclosure (LLM06) as top risks. Model inversion and membership inference are attack techniques that exploit LLM06 — extracting sensitive information that the model inadvertently memorised during training.6

Defences: Reducing What the Model Reveals

No single defence eliminates these risks entirely. The strategy is defence in depth — layering multiple controls to make attacks progressively harder.

1. Differential Privacy

Differential privacy adds carefully calibrated mathematical noise during training, ensuring that no single training example significantly influences the model's outputs. NIST SP 800-226 provides guidelines for evaluating differential privacy guarantees.7 The tradeoff: stronger privacy guarantees reduce model accuracy. For legal AI, this means calibrating the privacy budget (epsilon) to balance confidentiality protection against the model's usefulness for document classification or contract review.

2. Output Perturbation

Instead of modifying training, output perturbation adds noise to the model's responses at inference time. Rather than returning exact confidence scores (e.g., "92.7% contract, 7.3% memo"), the system rounds or randomises outputs ("likely contract"). This reduces the signal available to membership inference attacks without retraining the model.8

3. Access Controls on Model Endpoints

Rate-limit queries. Require authentication. Log every inference request. Monitor for the distinctive patterns of inference attacks — thousands of similar queries with slight variations, systematic probing of confidence boundaries. These controls do not prevent the attack mathematically, but they make it operationally harder and detectable.9

4. Restricting Confidence Scores

Many membership inference attacks depend on observing the model's confidence scores. If your API returns only the top prediction without probabilities, the attack surface shrinks substantially. The NIST AI Risk Management Framework recommends minimising unnecessary information in model outputs as a privacy-preserving measure.10

5. Training Data Governance

The most fundamental defence: do not train models on data that should not be in them. Maintain a clear inventory of training data. Separate privileged and non-privileged corpora. Use synthetic or anonymised data where possible. Document every dataset used in training with provenance metadata — which we will cover in detail in Episode 43.

A Decision Framework for Legal AI Builders

Question If Yes If No
Does the model process privileged client data? Apply differential privacy; restrict output detail; document training data provenance Standard access controls may suffice
Is the model fine-tuned on a small, client-specific corpus? High memorisation risk — consider synthetic data or federated learning Lower individual-document risk, but still apply output controls
Does the API return confidence scores or logits? Restrict to top-k predictions without scores; add output noise Reduced membership inference surface
Is the model accessible to external users? Rate limiting, authentication, anomaly detection are mandatory Internal-only access reduces but does not eliminate risk

What Hogan Lovells Gets Right

The law firm Hogan Lovells published a detailed analysis of model inversion and membership inference risks in the legal context, noting that "organisations should evaluate whether the AI model or system they are deploying has been tested for susceptibility to such attacks" and recommending that vendor contracts include specific provisions around training data governance and model security testing.11 This is the right instinct: technical controls and contractual protections are complementary, not alternatives.

What's Next

Episode 37 tackles Governed Writes and Human-in-the-Loop — the principle that your AI can draft, analyse, and recommend, but it should never autonomously file a document, send a client communication, or modify a legal record without a human professional reviewing and approving the action.

Sources & Further Reading

Sources & references

  1. WitnessAI, Model Inversion Attacks: How They Work, Risks, and Defenses.
  2. Zhou et al., Model Inversion Attacks: A Survey of Approaches and Countermeasures (arXiv, November 2024).
  3. K&L Gates, Generative AI Data, Attorney-Client Privilege, and the Work-Product Doctrine (February 2026).
  4. Shokri et al., Membership Inference Attacks Against Machine Learning Models (IEEE S&P, 2017).
  5. Colwell, The 2024-2025 MIA Landscape Reveals Relentless Evolution in Membership Inference Attack Sophistication.
  6. OWASP, Top 10 for LLM Applications 2025.
  7. NIST, SP 800-226: Guidelines for Evaluating Differential Privacy Guarantees.
  8. PMC, Algorithms That Remember: Model Inversion Attacks and Data Protection Law.
  9. Hogan Lovells, Model Inversion and Membership Inference: Understanding New AI Security Risks.
  10. NIST, AI Risk Management Framework (AI RMF 1.0).
  11. Hogan Lovells / JD Supra, Model Inversion and Membership Inference.
  12. Springer Nature, Defending Against Attacks in Deep Learning with Differential Privacy.