Episode 14 · Module 4 · Transport Security

API Gateway Patterns and Rate Limiting

18 May 2026 · 9:11 · Security for Legal SaaS

0:00 9:11

As legal SaaS grows from monolith to microservices, security logic scatters. Alice and Dan cover API gateway architecture as a single enforcement point, rate limiting algorithms (fixed window, sliding window, token bucket, leaky bucket), per-user and per-endpoint limits, circuit breakers, and the legal-specific attack scenarios — document download abuse, search endpoint enumeration, and e-filing deadline denial — that make rate limiting a security control, not just a performance optimisation.

Today’s Lesson

Security for Legal SaaS — Episode 14: API Gateway Patterns and Rate Limiting

The Single Enforcement Point

As your legal SaaS grows from a monolith to multiple services, security logic scatters. Authentication checks in every microservice — each small, independently deployed backend service. Rate limiting reimplemented per endpoint. Logging inconsistent across teams. An API gateway consolidates these concerns into a single enforcement layer — every request passes through one point where policy is applied consistently.

Key stat: Akamai's State of the Internet report found that API attacks grew 109% year-over-year, with credential abuse and resource exhaustion as the primary vectors. A gateway with proper rate limiting eliminates the majority of these automated attacks.

For legal tech, where a single unthrottled API call might export thousands of privileged documents, the gateway isn’t a performance optimisation — it’s a security boundary.

Gateway as Security Layer

What the Gateway Enforces

Concern	Without Gateway	With Gateway
Authentication	Each service validates tokens independently	Gateway validates once; services trust internal traffic
Rate limiting	Per-service, inconsistent, often missing	Centralised, per-user, per-endpoint, global
Request validation	Scattered schema checks	Single schema enforcement point
Logging/audit	Inconsistent formats, gaps	Every request logged uniformly
TLS termination	Each service manages certificates	Gateway terminates TLS; internal traffic on private network
IP allowlisting	Firewall rules per service	Gateway-level enforcement

Kong, AWS API Gateway, Envoy, and Traefik are production-grade options — each with different tradeoffs between managed simplicity and configurability.

Rate Limiting Strategies

Rate limiting prevents resource exhaustion, credential stuffing, data scraping, and denial-of-service. The strategy you choose depends on what you’re protecting.

Algorithm Comparison

Algorithm	Behaviour	Best For
Fixed window	Count requests per time window (e.g., 100/minute). Resets at window boundary.	Simple, predictable limits
Sliding window	Weighted average of current and previous window. Smooths boundary spikes.	Most API rate limiting
Token bucket	Tokens refill at steady rate; requests consume tokens. Allows bursts up to bucket size.	Allowing legitimate traffic bursts
Leaky bucket	Requests queue and drain at fixed rate. Excess drops.	Smoothing traffic to backend

Fixed window edge case: A user sends 100 requests at 11:59:59 and 100 more at 12:00:01. Both pass a 100/minute fixed window — but 200 requests hit your backend in 2 seconds. The sliding window algorithm solves this by weighting the overlapping window.

The Token Bucket for Legal SaaS

The token bucket algorithm is ideal for legal applications because it accommodates legitimate bursts — a lawyer batch-downloading case files at the start of their day — while maintaining a steady-state limit:

- Bucket capacity: Maximum burst size (e.g., 50 document downloads)

- Refill rate: Sustained throughput (e.g., 10 downloads per minute)

- Result: A lawyer can immediately download 50 files, then continues at 10/minute

Per-User, Per-Endpoint, and Global Limits

A single rate limit is insufficient. Layer them:

Scope	Example	Purpose
Per-user	1000 API calls/hour per authenticated user	Prevent compromised accounts from bulk exfiltration
Per-endpoint	10 document downloads/minute on `/api/documents/{id}/download`	Protect expensive operations
Per-IP	100 unauthenticated requests/minute per IP	Block credential stuffing before auth
Global	10,000 requests/second across all users	Protect infrastructure from DDoS

Stripe’s rate limiting approach documents this multi-layer pattern in production — per-user limits for fairness, per-endpoint limits for protection, and global limits for stability.

Legal SaaS Attack Scenarios

Document Download Abuse

A compromised user account (or malicious insider) attempts to bulk-download all client files. Without per-endpoint rate limiting on the download API, thousands of privileged documents exfiltrate in minutes.

Mitigation: Per-user download limit of 50 documents/hour. Alert on any user exceeding 80% of the limit. Require step-up authentication (MFA re-verification) for bulk exports exceeding 20 documents in a session.

Search Endpoint Abuse

Full-text search endpoints are computationally expensive and reveal data through query patterns. An attacker systematically queries terms to map your document corpus without downloading files directly.

Mitigation: Rate limit search to 30 queries/minute per user. Log query patterns. Alert on systematic enumeration (sequential date ranges, alphabetical name lists). Consider differential privacy on result counts.

E-Filing Deadline Denial

An attacker targets your e-filing API with legitimate-looking but invalid requests during known deadline periods (discovery deadlines, motion filing windows). The goal isn’t data theft — it’s consuming your rate limit slots so legitimate filings fail.

Mitigation: Priority queues for authenticated users with pending deadlines. Separate rate limit pools for filing operations vs. general API calls. Circuit breakers that shed load from non-critical endpoints to protect filing pathways.

Circuit Breakers

When a downstream service fails, cascading retries can collapse your entire system. The circuit breaker pattern prevents this:

State	Behaviour
Closed (normal)	Requests flow through; failures counted
Open (tripped)	Requests fail immediately without calling the downstream service
Half-open (testing)	Limited requests probe whether the service has recovered

For legal SaaS: if your document storage service becomes slow, a circuit breaker prevents the API gateway from queuing thousands of requests that overwhelm memory. Users see a fast “service temporarily unavailable” instead of a 30-second timeout followed by an error.

Response Headers for Rate Limit Transparency

The IETF RateLimit header fields specification standardises how servers communicate rate limit status:

RateLimit-Limit: 100
RateLimit-Remaining: 67
RateLimit-Reset: 1620000000

Include these on every response. Legitimate clients use them to self-throttle. Your own frontend uses them to display “please wait” instead of error messages. GitHub’s API demonstrates this pattern well — clients can check remaining quota before making expensive calls.

Implementation Architecture

Recommended stack: - **Gateway:** Kong, AWS API Gateway, or Envoy (depending on cloud/self-hosted preference) - **Rate limit store:** Redis (atomic INCR with TTL) — Redis rate limiting patterns - **Counters:** Sliding window with per-user and per-endpoint dimensions - **Response:** 429 Too Many Requests with RateLimit headers and Retry-After - **Monitoring:** Alert when any user consistently hits 80% of their limit - **Bypass:** Internal service-to-service traffic bypasses user rate limits (uses separate auth; covered in Episode 15)

Conclusion

An API gateway is where security policy becomes enforceable infrastructure. Rate limiting protects against both external attacks and internal abuse — compromised accounts, malicious insiders, and automated scraping. For legal SaaS, where a single unthrottled endpoint can exfiltrate an entire case file repository, the gateway is not optional architecture — it’s a security control as fundamental as authentication.

Next episode: Service-to-Service Authentication — because your internal services need to prove their identity to each other too.

Alice: Welcome back to Security for Legal SaaS. I’m Alice.

Dan: And I’m Dan. Episode 14 — API Gateway Patterns and Rate Limiting. Alice, why does a legal SaaS application need a gateway in front of its APIs?

Alice: Because without one, security logic scatters. Every microservice — every small, independently deployed backend service — reimplements authentication. Rate limiting is inconsistent — some endpoints have it, some don’t. Logging has gaps. An API gateway is a single enforcement point — every request passes through one layer where policy is applied consistently. Authentication, rate limiting, request validation, audit logging — all in one place.

Dan: What does that look like architecturally?

Alice: Client requests hit the gateway first. The gateway validates the authentication token — the credential that proves a request comes from a logged-in user — checks rate limits, validates the request schema, logs the request, and then — only if everything passes — forwards to the appropriate backend service. The backend services trust internal traffic from the gateway. They don’t each need to re-validate tokens or implement their own rate limiting.

Dan: Let’s focus on rate limiting. Why is it a security concern and not just a performance concern?

Alice: Because without rate limits, a single compromised account can exfiltrate your entire document repository in minutes. A malicious insider can bulk-download every client’s privileged files before anyone notices. An attacker can credential-stuff — trying stolen username-password pairs from other breaches — your login endpoint at thousands of attempts per second. An automated scraper can map your entire matter database through the search API. Rate limiting isn’t about server performance — it’s about limiting the blast radius of any successful attack.

Dan: What are the main algorithms? I know there’s more than one way to count requests.

Alice: Four common approaches. Fixed window — count requests per time period, like 100 per minute. Simple, but has an edge case where a user sends 100 requests at 11:59:59 and 100 more at 12:00:01 — both pass the limit but your backend sees 200 in two seconds. Sliding window fixes this by weighting the overlap between current and previous windows.

Dan: And the others?

Alice: Token bucket — imagine a bucket that refills at a steady rate. Each request takes a token. If the bucket is empty, the request is rejected. The bucket’s capacity determines burst size, the refill rate determines sustained throughput. This is ideal for legal SaaS because it accommodates legitimate bursts — a lawyer batch-downloading files at the start of their day — while maintaining a steady-state limit. Then there’s leaky bucket, which smooths traffic by draining requests at a fixed rate. Excess requests either queue or drop.

Dan: For legal SaaS specifically — what should the limits look like?

Alice: Layer them. Per-user — say 1000 API calls per hour for a standard account. This limits damage from a compromised account. Per-endpoint — 10 downloads per minute on the document download API, because that’s an expensive operation that directly accesses privileged data. Per-IP for unauthenticated endpoints — 100 requests per minute to block credential stuffing before authentication even happens. And a global limit — 10,000 requests per second total — to protect infrastructure from volumetric DDoS — distributed denial-of-service, where attackers flood your system with traffic from many sources.

Dan: Give me a specific attack scenario. What happens without proper rate limiting?

Alice: Scenario: opposing counsel compromises a junior associate’s credentials through phishing. They log in during off-hours — say 2am Sunday. Without per-endpoint rate limiting on the document download API, they write a script that downloads every document the associate has access to. In legal SaaS with proper multi-tenant access controls — multi-tenant meaning one system serves many separate law firms, each seeing only their own data —, that associate might have access to dozens of matters, each with hundreds of files. Without rate limits, the entire corpus exfiltrates in under an hour. With a 50-downloads-per-hour limit and alerting at 80% — the attack gets 50 documents before triggering a security alert.

Dan: What about search endpoint abuse? That’s more subtle.

Alice: Much more subtle. An attacker doesn’t need to download documents if they can enumerate them through search. They query systematically — every company name, every date range, every case number pattern. Over time, they build a map of your entire document corpus. What matters exist, who the parties are, what the date ranges suggest about deal timelines. Without search rate limiting, this is invisible. With it — 30 queries per minute per user, with alerting on systematic enumeration patterns — you catch it early.

Dan: You mentioned circuit breakers earlier. How do those fit in?

Alice: When a downstream service fails — your document storage becomes slow, your search index is overloaded — naive retry logic makes it worse. Every client retries, overwhelming the already-struggling service. A circuit breaker detects the failure, stops sending requests, and returns fast errors instead. Three states — closed is normal operation, open means the service is down and requests fail immediately, half-open sends limited probes to detect recovery. For legal SaaS — if document storage is slow, you want users to see "temporarily unavailable" instantly rather than waiting 30 seconds for a timeout.

Dan: There’s a specific legal tech scenario here too — deadline-based denial of service.

Alice: This one is nasty. An attacker doesn’t need to take your whole platform down. They just need to overwhelm your e-filing endpoint during the window when their opponent has a filing deadline. Send thousands of malformed but authentication-valid requests that consume rate limit slots. Legitimate filing attempts get a 429 Too Many Requests error — the server’s way of saying "slow down, you’ve exceeded your limit." The deadline passes. The fix is priority queues — authenticated users with pending court deadlines get separate rate limit pools from general API traffic. Filing operations get their own capacity that can’t be starved by other endpoints.

Dan: How should the gateway communicate limits to clients?

Alice: Standard response headers. RateLimit-Limit tells the client their quota. RateLimit-Remaining tells them how many requests they have left. RateLimit-Reset tells them when the window resets. Include these on every response. Your own frontend uses them to show users "you’ve used 80% of your download limit today" instead of just failing with no explanation. Well-behaved API clients self-throttle based on these headers — reducing load on your infrastructure organically.

Dan: Where do you store the rate limit counters?

Alice: Redis — a fast in-memory data store. Atomic increment — meaning the count increases in a single indivisible step so two simultaneous requests can’t both read the same number — with TTL — time-to-live, an automatic expiry timer. The sliding window implementation stores the count for current and previous windows, computes a weighted sum, and rejects if it exceeds the limit. Redis handles this at tens of thousands of operations per second with sub-millisecond latency. For multi-region deployments, you need to decide — local rate limiting per region, or global via a central Redis cluster. Local is faster but means a user can multiply their limit by hitting different regions.

Dan: Next episode we’re covering service-to-service authentication — because everything behind the gateway still needs to prove its identity.

Alice: Where mTLS — mutual TLS, where both sides verify each other’s identity — service mesh, and JWT service tokens — JSON Web Tokens, as we covered in Episode 6 — replace the "trust the network" assumption that gets breached every time. Until then, I’m Alice.

Dan: And I’m Dan.

Alice: Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.

Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.