Episode 40 · Module 8 · AI Security

Local vs. Cloud AI — Security Boundaries

19 May 2026 · 8:30 · Security for Legal SaaS

8:30 8:30

In Episode 39, we built a redaction pipeline to strip sensitive data before sending prompts to cloud AI. This episode asks the next logical question: what if you skip the cloud entirely and run the AI model on your own hardware? Local AI deployment — running large language models on servers you own and control — is increasingly practical. Models like Llama, Mistral, Qwen, and Gemma can run on consumer-grade GPUs. For legal AI handling the most sensitive data — privileged communications, litigation strategy, government investigation materials — keeping everything within your network perimeter can eliminate entire categories of risk.

Today’s Lesson

Security for Legal SaaS — Episode 40: Local vs. Cloud AI — Security Boundaries

Your Data Never Leaves. But "Local" Has Its Own Risks.

Local AI deployment — running large language models on servers you own and control — is increasingly practical. Models like Llama, Mistral, Qwen, and Gemma can run on consumer-grade GPUs. For legal AI handling the most sensitive data — privileged communications, litigation strategy, government investigation materials — keeping everything within your network perimeter can eliminate entire categories of risk. But "local" introduces a different threat surface, and understanding the tradeoffs is the difference between genuine security and security theatre.

Cloud AI: What You Gain and What You Give Up

When you call a cloud LLM provider — OpenAI, Anthropic, Google, AWS Bedrock — your data traverses networks and is processed on infrastructure outside your control. Even with contractual protections (data processing agreements, zero-retention policies, SOC 2 attestations), several fundamental risks remain:¹

Cloud Risk	Description
Data in transit	Prompts and responses travel over the public internet (encrypted, but still)
Provider-side breach	Your data is on someone else's servers; their breach is your breach
Subpoena exposure	A subpoena served on the provider could capture your data
Jurisdictional issues	Data may be processed in regions with different privacy laws
Model training risk	Without zero-retention agreements, data may be used to improve models
Third-party access	Provider employees with administrative access could theoretically access data

What you gain is substantial: state-of-the-art model quality, automatic scaling, zero infrastructure management, and access to the largest and most capable models that require hundreds of GPUs to run.

Local AI: What You Gain and What You Give Up

Running a model locally means your data never leaves your network. No prompts cross the internet. No third party processes your privileged communications. For compliance and privilege protection, this is the strongest possible posture.²

But the threat surface shifts — it doesn't disappear:

Local Risk	Description
You own the patching	Model vulnerabilities, OS patches, driver updates — all your responsibility
Physical security	The hardware running the model needs the same physical protection as your file servers
Model provenance	Who built the model? Was the model file tampered with between download and deployment?
Network exposure	A locally deployed model with a network-accessible API is an attack surface
Insider threats	Internal users with access to the model endpoint can extract data through the same techniques we covered in EP36 (model inversion, membership inference)
Capability gap	Smaller models make more errors; lower quality increases the risk of incorrect legal analysis
Operational burden	GPU procurement, cooling, power, monitoring, failover — all on you

Vitalik Buterin's local LLM setup (April 2026): Ethereum founder Vitalik Buterin published his personal local AI security setup, arguing that for sensitive personal data, "running the model locally means the data never leaves your machine" — but noting that local deployment requires ongoing vigilance around model integrity, network isolation, and access controls.³

The Hybrid Architecture

The most practical approach for legal AI is hybrid: route sensitive data to local models and general tasks to cloud AI. This combines the privacy of local inference with the capability of cloud models.

Data Category	Routing	Rationale
Privileged communications	Local only	Privilege waiver risk from cloud exposure
Litigation strategy memos	Local only	Highest-sensitivity work product
Client PII (names, financials)	Local, or cloud with redaction (EP39)	Confidentiality obligations
General legal research	Cloud (no client data in prompt)	Public information, no confidentiality risk
Document formatting/structure	Cloud (with redaction)	Low-sensitivity task, benefits from larger models
Contract template analysis	Cloud (anonymised)	Templates contain no client-specific data

Decision Framework

Ask four questions for each AI task:

Data sensitivity: Does the prompt contain privileged, confidential, or personally identifiable information?
Model capability: Does the task require a frontier model (GPT-4, Claude Opus), or can a smaller local model handle it?
Compliance mandate: Do applicable regulations (GDPR, HIPAA, data localisation laws) require data to stay within your jurisdiction?
Acceptable risk: If the data were exposed, what is the worst-case consequence?

If the answer to question 1 is "yes" and the answer to question 2 is "a local model can handle it," the choice is clear: run locally.⁴

Securing Local Inference

"Local" does not mean "automatically secure." A model running on your network needs the same security controls as any other service:

Network Isolation

The local model endpoint should not be accessible from the public internet. Deploy it in an isolated network segment — the same principle we covered in Episode 15 on network segmentation. Only your application servers should be able to reach the model API. Use mTLS (mutual TLS, from Episode 15) to authenticate clients connecting to the model endpoint.⁵

Access Controls

Not every user in your organisation should have access to the local model. Apply role-based access controls. Log every inference request with the authenticated identity of the requester, the prompt content (encrypted), and the response. These logs feed into the audit trail we'll design in Episode 41.

Model File Integrity

When you download a model from Hugging Face, Ollama, or any other source, you are trusting that the file has not been tampered with. Verify checksums. Use signed model files where available. Store models in a read-only filesystem. Monitor for unexpected changes to model files — a compromised model could exfiltrate data through its outputs or produce subtly incorrect legal analysis.⁶

Inference Endpoint Security

Even a local model API should have:

Authentication: Every request must prove the caller's identity
Rate limiting: Prevent abuse and detect anomalous query patterns (EP36 inference attacks)
Input validation: Reject prompts that exceed expected length or contain suspicious patterns
Output filtering: Scan responses for unintended data leakage before returning to the user

The Capability-Security Tradeoff

The honest challenge with local AI: smaller models are less capable. A 7-billion-parameter model running on a single GPU will not match GPT-4 or Claude Opus on complex legal reasoning, nuanced contract analysis, or multi-jurisdictional research. The security benefit of local deployment must be weighed against the quality risk of using a less capable model for consequential legal work.⁷

Model Size	Typical Hardware	Rough Capability	Best For
7B parameters	Single consumer GPU (24GB VRAM)	Basic summarisation, simple Q&A	Document triage, simple classification
14-35B parameters	Single high-end GPU (48GB VRAM)	Competent drafting, clause analysis	Contract review, privilege screening
70B+ parameters	Multi-GPU or cloud	Near-frontier quality	Complex legal reasoning, research memos
Frontier (200B+)	Cloud only	State of the art	Everything; required for some tasks

Cost reality check: A cloud server with 8x NVIDIA H100 GPUs costs approximately $98 per hour. The same hardware on-premises costs about $0.87 per hour in electricity. The breakeven point is roughly 12 months of continuous use — after which on-premises is dramatically cheaper.⁸ For a firm running AI workloads continuously, the economics of self-hosting are compelling. For occasional use, cloud is more practical.

Regulatory Drivers

Several regulatory frameworks push toward local or on-premises AI for sensitive data:

GDPR Article 44-49: Cross-border data transfers require adequate protections. Local deployment eliminates transfer concerns entirely.
HIPAA (for health-adjacent legal work): Protected health information processed through an AI system must be covered by a Business Associate Agreement with the provider — or processed locally.
Data localisation laws: Jurisdictions including China, Russia, India, and increasingly the EU mandate that certain data categories remain within national borders.
Legal professional privilege (UK/SG): Disclosure to a third-party processor may not inherently waive privilege, but it complicates privilege claims and adds litigation risk.⁹

For law firms handling international matters, the safest approach is often to keep the AI local and avoid the jurisdictional analysis entirely.

Practical Recommendations

Default to cloud with redaction (EP39) for general-purpose tasks where client data can be effectively stripped
Deploy a local model for privileged, high-sensitivity, or regulated data that should not leave your network under any circumstances
Use the inference gateway (EP38) as the routing decision point — it examines each prompt, checks data classification, and routes to local or cloud accordingly
Secure the local endpoint with network isolation, authentication, rate limiting, and model file integrity verification
Monitor both paths — audit logs should capture every inference request regardless of whether it went to cloud or local, with the routing decision and its rationale logged

What's Next

Episode 41 moves to Module 9 — Audit and Logging, starting with Audit Log Design — the structured records that capture who did what, when, and to which resource. Every control we've discussed across 40 episodes depends on logs. If you can't prove it happened, it didn't happen.

Sources & Further Reading

Sources & references

Prediction Guard, Self-Hosted vs. Third-Party Deployment: A Technical Evaluation Guide for Regulated Enterprises.
DataNorth AI, Local LLM: Privacy, Security, and Control.
Vitalik Buterin, My Self-Sovereign / Local / Private / Secure LLM Setup (April 2026).
AIMultiple, Cloud LLM vs Local LLMs: Examples & Benefits.
Digital Applied, Local LLM Deployment: Privacy-First AI Complete Guide.
EPAM SolutionsHub, Open LLM Security Risks and Best Practices.
Unified AI Hub, On-Prem LLMs vs Cloud APIs: When to Run Models Locally.
GodOfPrompt, Local LLM Setup for Privacy-Conscious Businesses.
Spellbook, Most Private AI for Lawyers: Why Zero Data Retention Wins in 2026.
Matillion, Public vs Private LLMs: Secure AI for Enterprises.

Alice: Welcome back to Security for Legal SaaS. I'm Alice.

Dan: And I'm Dan. Episode 40 — local versus cloud AI and where the security boundaries actually are. Alice, we spent the last two episodes on protecting data when it goes to the cloud — API keys, redaction pipelines. This episode is about skipping the cloud entirely?

Alice: It's about knowing when to skip it and when not to. Running a model locally — on your own server, in your own office or data centre — means your data never leaves your network. No prompts cross the internet. No third-party processor handles your privileged communications. For the most sensitive legal work, that's the strongest possible posture. But "local" doesn't mean "automatically secure." It means the threat surface shifts.

Dan: Mm. Shifts how?

Alice: With cloud AI, the risks are about data leaving your control — the provider might be breached, subpoenaed, or use your data in ways you didn't expect. With local AI, those risks disappear. But now you own every other risk. You're responsible for patching the model, the operating system, the GPU drivers. You're responsible for physical security of the hardware. You're responsible for controlling who can access the model endpoint on your network. And here's one people don't think about — model provenance. When you download a model from Hugging Face or another repository, you're trusting that the file hasn't been tampered with. A compromised model could produce subtly incorrect legal analysis or even exfiltrate data through its outputs.

Dan: Right. So it's not a simple choice. What does the practical setup look like for a firm that wants to do both?

Alice: Hybrid architecture. You route sensitive data to a local model and general tasks to the cloud. The inference gateway we built in Episode 38 is the decision point — it examines each prompt, checks the data classification, and routes accordingly. Privileged communications, litigation strategy memos, anything with client PII — those go to the local model. General legal research, document formatting, analysis of publicly available information — those can go to the cloud, especially if you're applying the redaction pipeline from Episode 39.

Dan: Hmm. But here's the thing — local models are smaller, right? They're not as good as GPT-4 or Claude. Does that matter?

Alice: <sigh> It matters a lot, and it's the honest tradeoff nobody wants to talk about. A 7-billion-parameter model running on a single GPU can do basic summarisation and simple classification. A 14 to 35-billion-parameter model on a high-end GPU can handle competent contract review and privilege screening. But for complex legal reasoning — multi-jurisdictional analysis, nuanced interpretation of case law, drafting sophisticated arguments — you need models at the 70-billion-parameter level or above, and those either require multiple GPUs or cloud infrastructure. Frontier models like GPT-4 or Claude Opus are still cloud-only.

Dan: Yeah. So a firm might use a local model for privilege review — where you're mostly classifying documents as privileged or not — but still need cloud AI for the heavy legal research?

Alice: Exactly. And that's a perfectly reasonable architecture. The privilege review is the highest-sensitivity task — you're handling privileged documents, and an incorrect classification has discovery consequences. The data should stay local. The legal research might involve only public information — case law, statutes, regulatory guidance — where there's no client data in the prompt at all. That's safe for the cloud.

Dan: Mm-hmm. What about the cost side? I've heard local AI is expensive upfront but cheap to run.

Alice: The numbers are striking. A cloud server with eight NVIDIA H100 GPUs — the kind of hardware you need for large models — costs about $98 per hour. Running the same hardware on-premises costs roughly 87 cents per hour in electricity. The breakeven is about 12 months of continuous use. After that, on-premises is dramatically cheaper. For a firm running AI workloads throughout the business day, every day, the economics of self-hosting become very attractive very quickly.

Dan: Right. But you need someone to manage the hardware.

Alice: You do, and that's a real cost. GPU procurement, cooling, power redundancy, monitoring, failover planning, driver updates. A solo practitioner isn't going to run a GPU server in a closet. But a mid-size firm or a legal technology company building products for the legal market? The infrastructure cost is increasingly justifiable, especially when the alternative is sending privileged data to the cloud.

Dan: Mm. Let's talk about securing the local model itself. You mentioned network isolation — what does that actually look like?

Alice: Same principles as Episode 15 on network segmentation. The local model endpoint should not be accessible from the public internet. Period. Deploy it in an isolated network segment. Only your application servers should be able to reach the model API. Use mutual TLS — mTLS, from Episode 15 — so that both the client and the server prove their identity before any data flows. Rate-limit queries, because the inference attacks we discussed in Episode 36 — model inversion, membership inference — work just as well against a local model as a cloud one. And log every single inference request with the authenticated identity of the requester.

Dan: Yeah. The model being local doesn't stop an insider from running thousands of queries to extract training data.

Alice: Not at all. Local deployment protects against external threats — your data doesn't cross the internet, third parties can't access it. But internal threats require the same controls regardless of where the model runs. Authentication, rate limiting, anomaly detection, audit logging. Speaking of which — next episode is about audit log design, and everything we've discussed about logging inference requests feeds directly into that.

Dan: Mm. Quick question on the regulatory side — are there situations where local AI isn't just better security, it's actually required?

Alice: Several. GDPR has strict rules about cross-border data transfers — if you process EU citizens' data through a US cloud provider, you need adequate transfer mechanisms in place. Local deployment in the EU eliminates that concern entirely. Data localisation laws in jurisdictions like China, Russia, and India mandate that certain data stays within national borders. And for legal professional privilege specifically — while sending data to a cloud processor doesn't automatically waive privilege, it complicates privilege claims and adds litigation risk. If you can avoid the argument entirely by keeping the data local, that's a stronger position.

Dan: Hmm. So to summarise — the recommendation isn't "go local" or "go cloud." It's "know your data, know your risks, and route accordingly."

Alice: That's exactly it. Default to cloud with redaction for general tasks where client data can be effectively stripped. Deploy a local model for privileged, high-sensitivity, or regulated data that should never leave your network. Use your inference gateway to make the routing decision automatically based on data classification. Secure the local endpoint as rigorously as you'd secure any other internal service. And monitor both paths — audit logs should capture every inference request regardless of where it went.

Dan: Next episode — we move to Module 9, Audit and Logging. Starting with Audit Log Design — the foundation for proving that all these controls actually work.

Alice: Until then, I'm Alice.

Dan: And I'm Dan.

Alice: Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.

Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.