Episode 44 · Module 9 · Audit & Logging

Correlation IDs and Distributed Tracing

19 May 2026 · 9:10 · Security for Legal SaaS

9:10 9:10

A lawyer using your legal SaaS platform clicks "Generate Summary" on a 200-page disclosure bundle. The spinner runs for thirty seconds. Then an error: "Something went wrong. Please try again." The lawyer contacts support. Your team opens the logs. The request touched the API gateway, the authentication service, the document parser, the AI summarisation engine, the billing service, and the audit logger. Six services, six separate log files, six different timestamps. Which log entry belongs to this request?

Today’s Lesson

Security for Legal SaaS — Episode 44: Correlation IDs and Distributed Tracing

A User Reports "Something Went Wrong." Now What?

Without a correlation ID, you are searching a haystack for a needle that might be in a different haystack entirely. With one, you have a thread connecting every log entry, every service call, and every database query that this single request triggered.

What Is a Correlation ID?

A correlation ID is a unique identifier — typically a UUID (Universally Unique Identifier, a 128-bit random string we first mentioned in Episode 10) — generated at the entry point of a request and propagated through every service that request touches. Every log entry, every inter-service call, every database query includes this ID. When something goes wrong, you search for that single ID and instantly see the complete journey of the request across your entire system.

Microsoft's Engineering Fundamentals Playbook ¹ defines correlation IDs as identifiers that "track one request end-to-end across all services while staying the same from entry point to final response." The concept is simple. The discipline of propagating it consistently through every layer of your stack is where most teams fail.

Key distinction: A correlation ID stays the same across the entire request. A span ID identifies a single operation within that request — one database query, one API call, one function execution. A trace ID groups all spans into a tree structure showing parent-child relationships. Together, they form a complete picture: the trace shows you the shape of the request, spans show you the individual steps, and the correlation ID ties it all to the original user action.

From Correlation IDs to Distributed Tracing

Correlation IDs tell you which log entries belong together. Distributed tracing goes further — it shows you the structure, timing, and dependencies of every operation within the request. Think of it as the difference between knowing which pages of a case file relate to the same matter (correlation ID) and having a full timeline of events with causal links between them (distributed trace).

OpenTelemetry ² is the industry standard for distributed tracing. It is an open-source observability framework — a set of tools, APIs, and SDKs that instrument your applications to collect traces, metrics, and logs in a standardised format. OpenTelemetry is vendor-neutral, meaning your traces can be sent to any compatible backend: Jaeger, Zipkin, Datadog, Grafana Tempo, or AWS X-Ray.

How Context Propagation Works

When Service A calls Service B, the trace context — including the trace ID and current span ID — must travel with the request. OpenTelemetry's context propagation ² handles this automatically through HTTP headers:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

This single header carries the trace ID, the parent span ID, and trace flags. The receiving service extracts the context, creates a new span as a child of the incoming span, and continues propagation to the next service.

For a legal SaaS platform, a single "summarise this document" request might generate a trace like:

Span	Service	Duration	Notes
`api-gateway`	Gateway	15,230ms	Auth check, rate limit
`auth-verify`	Auth Service	12ms	JWT validation
`doc-parse`	Document Parser	3,400ms	PDF extraction
`ai-summarise`	AI Engine	11,200ms	LLM inference
`audit-log`	Audit Logger	45ms	Provenance record (EP43)
`billing-record`	Billing	18ms	Usage metering

The trace waterfall — a visual representation showing each span as a horizontal bar on a timeline — immediately reveals that the AI inference took 11.2 seconds of the total 15.2 seconds. That is actionable information for both performance tuning and incident investigation.

Security Applications of Distributed Tracing

Distributed tracing is not just a performance tool. It is a security tool. Traces provide incident reconstruction capability ³ that logs alone cannot match.

Incident Reconstruction

When a security incident occurs — unauthorized data access, a privilege escalation, an injection attempt — traces let you reconstruct exactly what happened. Which service received the malicious input? Where did it propagate? Which downstream services were affected? Traditional log analysis requires correlating timestamps across services and hoping the clocks are synchronised. Traces give you the causal chain directly.

Anomaly Detection

Traces establish baseline patterns for normal behaviour. A document review request that normally touches four services suddenly touching eight — including the admin API — is an anomaly worth investigating. Trace-based anomaly detection ⁴ can flag these patterns automatically.

Access Pattern Auditing

For multi-tenant legal SaaS (where multiple law firms share the same infrastructure, isolated by tenant boundaries as we discussed in Episode 8), traces can verify that a request from Firm A never touched Firm B's data partition. The trace shows every service hop and every database query, making cross-tenant access violations visible.

Security warning: Traces themselves can leak sensitive information. OpenTelemetry's documentation warns ² that internal trace IDs, span IDs, or baggage items might reveal information about internal architecture. Avoid putting sensitive data — user credentials, API keys, client names, case details — in span attributes or baggage. Trace the structure of the request, not the content.

Implementing Correlation IDs in Your Legal SaaS

Step 1: Generate at the Edge

The correlation ID should be generated at the first point of entry — your API gateway or load balancer. If the incoming request already carries a correlation ID (from a client SDK or a frontend), validate its format but accept it. If not, generate a new UUID.

Step 2: Propagate Through Every Layer

Every inter-service call must forward the correlation ID. For HTTP calls, use a standard header like `X-Correlation-ID` or the W3C `traceparent` header. For message queues (Kafka, RabbitMQ, SQS), include it in message metadata. For database queries, include it in query comments or context fields.

Step 3: Include in Every Log Entry

Every log line in every service must include the correlation ID. This is typically implemented through middleware that extracts the ID from the incoming request and injects it into the logging context. Structured logging ⁵ — logging in JSON format with consistent field names — makes this searchable.

json

{
  "timestamp": "2026-05-18T14:32:07Z",
  "level": "info",
  "correlation_id": "7f3a2b91-4e5c-4d2a-9f1b-3c8d7e6f5a4b",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "service": "document-parser",
  "message": "PDF extraction complete",
  "document_id": "doc_8821",
  "pages": 200,
  "duration_ms": 3400
}

Step 4: Connect to Your Alerting

Correlating traces with logs and metrics ⁶ through shared identifiers creates a unified observability stack. When an alert fires — high error rate, unusual latency, failed authentication spike — you can drill directly from the alert to the affected traces to the individual log entries. No more guessing which log file to grep.

Sampling Strategies — You Cannot Trace Everything

In a high-throughput legal SaaS platform processing thousands of document reviews per hour, tracing every single request generates enormous volumes of data. Sampling strategies ⁷ balance observability with cost:

Strategy	How It Works	Best For
Head-based	Decide at request start whether to trace (e.g., 10% of requests)	Steady-state monitoring
Tail-based	Trace everything, but only keep traces that meet criteria (errors, high latency)	Incident investigation
Priority-based	Always trace requests from specific users, endpoints, or tenants	VIP clients, admin operations

For legal SaaS, always trace: authentication failures, admin operations, cross-tenant queries, and AI generation requests (these connect to the provenance chains from Episode 43). Sample everything else based on volume.

Connecting Tracing to Your Security Posture

Distributed tracing integrates with your existing security infrastructure. Feed trace data into your SIEM (Security Information and Event Management — the system that aggregates security logs and alerts, introduced in Episode 7) for correlation with other security signals. Cross-signal correlation ⁸ — linking traces with metrics (like CPU spikes) and logs (like error messages) — gives you the most complete picture of what happened and why.

The provenance chains from Episode 43 and the correlation IDs from this episode are complementary. Provenance records what an AI system produced and from what inputs. Correlation IDs trace how the request moved through your system. Together, they give you complete accountability — from the user's click to the AI's output and everything in between.

Key takeaway: Correlation IDs and distributed tracing are not DevOps luxuries. They are security infrastructure. Without them, incident investigation is guesswork, anomaly detection is impossible, and proving what happened to a regulator requires stitching together fragments from a dozen separate log files. Build them in from the start.

What's Next

Next episode, we move to Module 10 — Infrastructure and Deployment. We'll start with Docker security and container hardening: what containers actually are, why "a container is not a security boundary," and how to configure them so they're at least a useful speed bump.

Sources & references

Alice: Welcome back to Security for Legal SaaS. I'm Alice.

Dan: And I'm Dan. Episode 44 — correlation IDs and distributed tracing. This sounds like it might get deep into the weeds, Alice. Break it down for me.

Alice: Let me set the scene. A lawyer using your legal SaaS platform clicks "Generate Summary" on a 200-page disclosure bundle. The spinner runs. Then an error — "Something went wrong. Please try again." The lawyer contacts support. Your team opens the logs. The problem is, that single click triggered activity across six different services — the API gateway, the authentication service, the document parser, the AI engine, the billing service, and the audit logger. Six services, six separate log files. Which log entries belong to this particular request?

Dan: Mm. That's like looking at six different filing cabinets for documents related to one matter, except none of them have the same case number.

Alice: Exactly. And that's the problem a correlation ID solves. It's a unique identifier — think of it as a case reference number — that gets created when a request first enters your system and then travels with that request through every single service it touches. Every log entry, every service call, every database query includes this ID. When something goes wrong, you search for that one ID and you instantly see the complete journey of the request across your entire platform.

Dan: Right. So it's generated once and just... passed along?

Alice: That's it. Your API gateway — the front door of your system — either receives one from the client or generates a fresh one. Then every time one service calls another, it includes the correlation ID in the request header. Every service writes the ID into its logs. One thread connecting everything.

Dan: Okay, so that's correlation IDs. What's distributed tracing? Is that different?

Alice: It builds on top of it. A correlation ID tells you which log entries belong together. Distributed tracing goes further — it shows you the structure, the timing, and the parent-child relationships of every operation within the request. Think of the difference between knowing which pages of a case file relate to the same matter versus having a full timeline of events with causal links between them. A trace shows you: this service called that service, which took 3 seconds, then called this other service, which took 11 seconds. You can see the shape of the request — where it went, in what order, and where the time was spent.

Dan: Hmm. So if I'm looking at that failed summary request, the trace would show me that the document parser took 3 seconds, the AI engine took 11 seconds, and then timed out?

Alice: Exactly. You'd see each step as a horizontal bar on a timeline — that's called a trace waterfall. And you'd immediately see that the AI inference was the bottleneck. No guessing, no correlating timestamps across different log files, no hoping the clocks are in sync.

Dan: Mm-hmm. And the industry standard for this is OpenTelemetry, right? I've heard the name.

Alice: OpenTelemetry is the dominant open-source framework. It gives you a standardised way to instrument your applications — meaning you add hooks that automatically collect trace data — and it works with any backend. Jaeger, Datadog, Grafana, AWS X-Ray. The clever part is context propagation. When Service A calls Service B over HTTP, OpenTelemetry automatically includes the trace context in a header. The receiving service picks it up, creates a child span — that's the name for one individual operation within the trace — and continues the chain. You don't have to manually wire this up for every call.

Dan: Yeah, that sounds manageable. But this is a security podcast, not a DevOps podcast. What's the security angle here?

Alice: <sigh> This is where most teams miss the point. They think tracing is only for performance debugging. But traces are one of the most powerful security tools you can deploy. First — incident reconstruction. When you have a security incident — an injection attempt, unauthorised data access, a privilege escalation — the trace shows you exactly what happened. Which service received the malicious input, where it propagated, which downstream services were affected. Without traces, you're stitching together fragments from a dozen log files.

Dan: Mm. What about detecting attacks in the first place?

Alice: That's the second use. Traces establish baseline patterns. A document review request normally touches four services. If one suddenly touches eight — including the admin API — that's an anomaly. You can build automated detection rules on trace patterns, not just on individual log entries. Third, for multi-tenant platforms — where multiple law firms share the same infrastructure — traces can verify that a request from Firm A never touched Firm B's data partition. The trace shows every service hop and every database query.

Dan: Right. But here's a concern — doesn't all this tracing data itself become a security risk? You're recording the path of every request.

Alice: Good instinct. Yes. OpenTelemetry's own documentation warns about this. Internal trace data can reveal your architecture — service names, dependencies, database structure. And if developers accidentally put sensitive data in span attributes — client names, case details, authentication tokens — then the tracing backend becomes a high-value target. The rule is: trace the structure of the request, not the content. Record that a database query ran and took 45 milliseconds. Don't record the query itself or the data it returned.

Dan: Mm. Makes sense. One practical question — can you trace everything? Doesn't this generate massive amounts of data?

Alice: It does. And you don't want to trace every single request in a high-volume system — that's expensive and mostly redundant. Sampling strategies help. Head-based sampling decides at the start — trace ten percent of requests. Tail-based sampling traces everything but only keeps traces that meet certain criteria, like errors or unusually high latency. And priority-based sampling always traces certain categories — admin operations, authentication failures, AI generation requests. For legal SaaS, my recommendation is: always trace security-relevant events and AI generation. Sample the rest.

Dan: Yeah, and that connects to the provenance chains from last episode, right? The AI generation traces feed into the provenance record?

Alice: They're complementary. Provenance records what the AI produced and from which inputs — model version, source documents, retrieval results. The correlation ID and trace show how the request moved through your system to get to that generation step. Together, you have complete accountability. From the lawyer's click, through authentication, through document parsing, through AI inference, to the final output. Every step documented, every step traceable.

Dan: Mm-hmm. And when a regulator asks "what happened to this client's document" — you can answer definitively.

Alice: Down to the millisecond. That's the promise. The practical advice is: implement correlation IDs first — they're simple and immediately valuable. Generate a UUID at your API gateway, propagate it through headers, include it in every log line. Then add OpenTelemetry for full distributed tracing. The instrumentation is mostly automatic for common frameworks. The security value comes from connecting traces to your alerting and your SIEM — the security information system we covered in Episode 7 — so anomalous patterns trigger investigation, not just performance dashboards.

Dan: Good breakdown. Next episode — we start Module 10, Infrastructure and Deployment. Docker security and container hardening. What containers are, why they're not the security boundary people think they are, and how to lock them down.

Alice: Until then, I'm Alice.

Dan: And I'm Dan.

Alice: Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.

Security for Legal SaaS is a series written with AI assistance. Alice and Dan are AI-generated voices — no professional advice here, just education.