Today’s Lesson
Security for Legal SaaS — Episode 44: Correlation IDs and Distributed Tracing
A User Reports "Something Went Wrong." Now What?
A lawyer using your legal SaaS platform clicks "Generate Summary" on a 200-page disclosure bundle. The spinner runs for thirty seconds. Then an error: "Something went wrong. Please try again." The lawyer contacts support. Your team opens the logs. The request touched the API gateway, the authentication service, the document parser, the AI summarisation engine, the billing service, and the audit logger. Six services, six separate log files, six different timestamps. Which log entry belongs to this request?
Without a correlation ID, you are searching a haystack for a needle that might be in a different haystack entirely. With one, you have a thread connecting every log entry, every service call, and every database query that this single request triggered.
What Is a Correlation ID?
A correlation ID is a unique identifier — typically a UUID (Universally Unique Identifier, a 128-bit random string we first mentioned in Episode 10) — generated at the entry point of a request and propagated through every service that request touches. Every log entry, every inter-service call, every database query includes this ID. When something goes wrong, you search for that single ID and instantly see the complete journey of the request across your entire system.
Microsoft's Engineering Fundamentals Playbook 1 defines correlation IDs as identifiers that "track one request end-to-end across all services while staying the same from entry point to final response." The concept is simple. The discipline of propagating it consistently through every layer of your stack is where most teams fail.
Key distinction: A correlation ID stays the same across the entire request. A span ID identifies a single operation within that request — one database query, one API call, one function execution. A trace ID groups all spans into a tree structure showing parent-child relationships. Together, they form a complete picture: the trace shows you the shape of the request, spans show you the individual steps, and the correlation ID ties it all to the original user action.
From Correlation IDs to Distributed Tracing
Correlation IDs tell you which log entries belong together. Distributed tracing goes further — it shows you the structure, timing, and dependencies of every operation within the request. Think of it as the difference between knowing which pages of a case file relate to the same matter (correlation ID) and having a full timeline of events with causal links between them (distributed trace).
OpenTelemetry 2 is the industry standard for distributed tracing. It is an open-source observability framework — a set of tools, APIs, and SDKs that instrument your applications to collect traces, metrics, and logs in a standardised format. OpenTelemetry is vendor-neutral, meaning your traces can be sent to any compatible backend: Jaeger, Zipkin, Datadog, Grafana Tempo, or AWS X-Ray.
How Context Propagation Works
When Service A calls Service B, the trace context — including the trace ID and current span ID — must travel with the request. OpenTelemetry's context propagation 2 handles this automatically through HTTP headers:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
This single header carries the trace ID, the parent span ID, and trace flags. The receiving service extracts the context, creates a new span as a child of the incoming span, and continues propagation to the next service.
For a legal SaaS platform, a single "summarise this document" request might generate a trace like:
| Span | Service | Duration | Notes |
|---|---|---|---|
| `api-gateway` | Gateway | 15,230ms | Auth check, rate limit |
| `auth-verify` | Auth Service | 12ms | JWT validation |
| `doc-parse` | Document Parser | 3,400ms | PDF extraction |
| `ai-summarise` | AI Engine | 11,200ms | LLM inference |
| `audit-log` | Audit Logger | 45ms | Provenance record (EP43) |
| `billing-record` | Billing | 18ms | Usage metering |
The trace waterfall — a visual representation showing each span as a horizontal bar on a timeline — immediately reveals that the AI inference took 11.2 seconds of the total 15.2 seconds. That is actionable information for both performance tuning and incident investigation.
Security Applications of Distributed Tracing
Distributed tracing is not just a performance tool. It is a security tool. Traces provide incident reconstruction capability 3 that logs alone cannot match.
Incident Reconstruction
When a security incident occurs — unauthorized data access, a privilege escalation, an injection attempt — traces let you reconstruct exactly what happened. Which service received the malicious input? Where did it propagate? Which downstream services were affected? Traditional log analysis requires correlating timestamps across services and hoping the clocks are synchronised. Traces give you the causal chain directly.
Anomaly Detection
Traces establish baseline patterns for normal behaviour. A document review request that normally touches four services suddenly touching eight — including the admin API — is an anomaly worth investigating. Trace-based anomaly detection 4 can flag these patterns automatically.
Access Pattern Auditing
For multi-tenant legal SaaS (where multiple law firms share the same infrastructure, isolated by tenant boundaries as we discussed in Episode 8), traces can verify that a request from Firm A never touched Firm B's data partition. The trace shows every service hop and every database query, making cross-tenant access violations visible.
Security warning: Traces themselves can leak sensitive information. OpenTelemetry's documentation warns 2 that internal trace IDs, span IDs, or baggage items might reveal information about internal architecture. Avoid putting sensitive data — user credentials, API keys, client names, case details — in span attributes or baggage. Trace the structure of the request, not the content.
Implementing Correlation IDs in Your Legal SaaS
Step 1: Generate at the Edge
The correlation ID should be generated at the first point of entry — your API gateway or load balancer. If the incoming request already carries a correlation ID (from a client SDK or a frontend), validate its format but accept it. If not, generate a new UUID.
Step 2: Propagate Through Every Layer
Every inter-service call must forward the correlation ID. For HTTP calls, use a standard header like `X-Correlation-ID` or the W3C `traceparent` header. For message queues (Kafka, RabbitMQ, SQS), include it in message metadata. For database queries, include it in query comments or context fields.
Step 3: Include in Every Log Entry
Every log line in every service must include the correlation ID. This is typically implemented through middleware that extracts the ID from the incoming request and injects it into the logging context. Structured logging 5 — logging in JSON format with consistent field names — makes this searchable.
{
"timestamp": "2026-05-18T14:32:07Z",
"level": "info",
"correlation_id": "7f3a2b91-4e5c-4d2a-9f1b-3c8d7e6f5a4b",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"service": "document-parser",
"message": "PDF extraction complete",
"document_id": "doc_8821",
"pages": 200,
"duration_ms": 3400
}
Step 4: Connect to Your Alerting
Correlating traces with logs and metrics 6 through shared identifiers creates a unified observability stack. When an alert fires — high error rate, unusual latency, failed authentication spike — you can drill directly from the alert to the affected traces to the individual log entries. No more guessing which log file to grep.
Sampling Strategies — You Cannot Trace Everything
In a high-throughput legal SaaS platform processing thousands of document reviews per hour, tracing every single request generates enormous volumes of data. Sampling strategies 7 balance observability with cost:
| Strategy | How It Works | Best For |
|---|---|---|
| Head-based | Decide at request start whether to trace (e.g., 10% of requests) | Steady-state monitoring |
| Tail-based | Trace everything, but only keep traces that meet criteria (errors, high latency) | Incident investigation |
| Priority-based | Always trace requests from specific users, endpoints, or tenants | VIP clients, admin operations |
For legal SaaS, always trace: authentication failures, admin operations, cross-tenant queries, and AI generation requests (these connect to the provenance chains from Episode 43). Sample everything else based on volume.
Connecting Tracing to Your Security Posture
Distributed tracing integrates with your existing security infrastructure. Feed trace data into your SIEM (Security Information and Event Management — the system that aggregates security logs and alerts, introduced in Episode 7) for correlation with other security signals. Cross-signal correlation 8 — linking traces with metrics (like CPU spikes) and logs (like error messages) — gives you the most complete picture of what happened and why.
The provenance chains from Episode 43 and the correlation IDs from this episode are complementary. Provenance records what an AI system produced and from what inputs. Correlation IDs trace how the request moved through your system. Together, they give you complete accountability — from the user's click to the AI's output and everything in between.
Key takeaway: Correlation IDs and distributed tracing are not DevOps luxuries. They are security infrastructure. Without them, incident investigation is guesswork, anomaly detection is impossible, and proving what happened to a regulator requires stitching together fragments from a dozen separate log files. Build them in from the start.
What's Next
Next episode, we move to Module 10 — Infrastructure and Deployment. We'll start with Docker security and container hardening: what containers actually are, why "a container is not a security boundary," and how to configure them so they're at least a useful speed bump.
Sources & references
- Correlation IDs — Microsoft Engineering Fundamentals Playbook
- Context Propagation — OpenTelemetry
- Trace Correlation for Root Cause Analysis — OneUptime
- Improving Platform Observability with Distributed Tracing — JAVAPRO
- Distributed Tracing Logs: How They Work — Groundcover
- Correlate OpenTelemetry Traces and Logs — Datadog
- Trace ID vs Correlation ID — Last9
- OpenTelemetry Signal Correlation — OneUptime
- Observability Beyond Monitoring: OpenTelemetry — Java Code Geeks
- AI Dial Core — Correlation IDs Delivery (GitHub Issue)