Distributed Tracing: The Why, What, and How?
In this post, I will describe one key challenge in complex distributed systems (Observability), why it is a challenge (hint: it is related to one logical operation being handled by many participants) and build with you in a bottom-up manner the building blocks needed to address it, and then extend that solution to be open and interoperable.
Ready to go on this journey? Let’s first start with a story about what happened to one of my restaurant orders...
A story of a delayed restaurant order
Once, a friend and I visited a new restaurant on its first day of its operations. We reached there early enough and placed our order. Slowly, more people started arriving and it got busier. After about 20 minutes, our food hadn’t arrived.
More time went by. We started seeing food being delivered to other tables. We saw this as a positive sign of the system being operational (you can see how optimistic we were!). A while later, we called the staff and asked about our order. They promised to check but they didn’t seem to know how to find out what was happening with our order. We waited some more. And some more. Though it felt like an eternity, after about 1.5 hours of waiting, we finally got our food.
Understanding what is happening in a complex distributed system can be a challenge.
You can think of this example as an operation in a simple distributed system with multiple participants: the customer, the waiter, kitchen manager, chef, dispatcher, and finally the waiter who delivers the food. In a more complex distributed system, a single logical operation can involve work done by hundreds or thousands of participants. When an operation is delayed or encountered an error in such a system, you want to get to one logical explanation of what is happening to that operation. But the distributed nature of this system makes it a challenge. Sure, each participant will have its own view of how it handled its request in the context of the overall operation. But how do you tie them all together to understand what is happening? More generally, how do we gain the ability to answer questions about a system to understand it better: who all participated in handling an operation? What is the critical path? Where do failures happen? What characteristics did orders that took more than 20 minutes have in common? What are the performance trends over time? What are the reliability trends over time? You should be able to ask any such question and get answers.
Let’s look more into this Observability challenge in complex distributed systems and how we can address it.
One Key Reason Why Observability Is A Challenge In Complex Systems
Ben Sigelman, a co-creator of the Dapper Distributed Tracing system, in one of his talks, describes how we have evolved from a model of “No concurrency” to “Distributed Concurrency” over the years. Let’s see what this means:
No Concurrency -> Basic Concurrency -> Asynchronous Concurrency -> Distributed Concurrency
No concurrency
Long time ago, a service such as a Web server or SMTP server handled requests by forking a child process to handle each new request. When you wanted to diagnose what went wrong with a request, you just had to look at the logs of that process.
Basic Concurrency
Systems then evolved to be more multi-threaded where a single thread handled a request. While this still has a scalability challenge, to diagnose issues you could use the thread name to understand what happened in the context of a request. My first ever project in my career (with an e-mail service provider) was building such a SMTP daemon replacing a previous implementation that used the “child process” approach.
Asynchronous Concurrency
Systems then evolved to have each thread contribute to handling multiple requests: Thread A could start handling a request, but then execution of this request may continue on a different thread. To diagnose such systems, just the process or thread information is not sufficient: you needed to capture a unique identifier for each request.
Distributed Concurrency
With microservices, we have now moved to the distributed concurrency model. An operation may start in Service A, which does some work in the context of that operation, then makes a network call to Service B, which does some work for it, and then calls Service C, and so on. So, in a system with dozens or hundreds of such participants, this makes it more challenging to get a sense of what happened in the context of a single operation, or to ask questions about the overall system based on a set of operations happening in the system.
The Building Blocks Needed To Tie Things Together In A Complex Distributed System (Take 1)
Let’s dive more into a few key building blocks we need to tie things together:
First, we need a way to uniquely identify a logical operation flowing through a system.
We also need a way for each participant to know this unique identifier. This is so that each participant can associate what it is doing to handle its part of that operation. But how can each participant know this? One participant would have to first create this identifier, and then it must be somehow “propagated” to the others. This mechanism must be protocol specific: e.g., for HTTP, this must be through a HTTP header.
We also then a need a way for each participant to emit what it did. This should cover information like: What time did a participant’s work start? When did it end? What attributes or events are associated with that unit of work? We should be able to uniquely identify (within the scope of a single logical operation) the above unit of work done by each participant. We also need a way to aggregate the above log streams (of what each participant did) to reconstruct the full picture of what happened.
But the above is not enough: we also need to establish the causality of the work done by different participants. For this, we need a way to capture the parent/child relationships: when a participant calls another, the callee/child needs to know the unique identifier of the caller/parent’s unit of work, so that the child can emit it. How does the child get this? It must be propagated.
If we have all the above, we will gain the ability to connect the dots together.
The set of related events emitted by various participants, triggered by the same logical operation, is called a Distributed Trace. It could be across process, network, and security boundaries. The context that is propagated between the participants is referred to as the “trace context” that includes the identifiers described above.
OK, great, so you come up with your scheme for defining and propagating the above context. You update all your applications to support the above. Things are getting better. But one day, when you are trying to diagnose an issue, you get stuck: you realize that some of your dependencies from other vendors don’t understand your scheme and are dropping them. They happen to have their own formats. You are back to where you started and are not able to reason about the whole system.
How can we solve this?
Take 2: One More Building Block (Interoperability) To Tie Things Together In A Complex System
Why did the context propagation stop? Our operation was crossing boundaries to services provided by other vendors. But we hadn’t agreed upon a common mechanism for the trace context. Hence, even if they had support for distributed tracing using a different format, we still wouldn’t have been able to correlate across all the services.
The one additional thing we need is interoperability. We need a mechanism that everybody can agree to use. In other words, we need a standard. Thanks to the efforts of many contributors across the industry, we have just that: the W3C TraceContext specification. You can find the official version at https://www.w3.org/TR/trace-context/.
What does the W3C TraceContext define?
This specification defines a universally agreed upon format to propagate trace context data. It specifies standard HTTP headers and standard value formats. It also defines a mechanism to forward vendor-specific trace data. I will provide a non-normative summary below, but for full details I would encourage you to look at the official specification. Here are the two headers that make up the “trace context”:
Header #1: traceparent HTTP header
This includes information on the unique identifier for the overall operation (trace-id), the identifier for a distinct unit of work (parent-id), a version identifier, and flags (e.g., used to communicate whether the caller decided to “sample in” a trace).
traceparent: version-traceid-parentid-flags
e.g.: traceparent: 00–0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331–01
Header #2: tracestate HTTP header
This is for use by tracing systems to propagate any vendor specific state. This is a set of key value pairs. Note that this is not meant for propagating user-defined key-value pairs: there is a another draft specification for it (W3C Baggage) that I will cover later.
With this standard, we get an interoperable way to do context propagation. But how can we have applications and services instrument and collect tracing data in an open and industry standard way?
OpenTelemetry: An Open And Vendor-Neutral Way For Telemetry Data
Let’s say you have several applications and services in your system. To be able to better understand your system, you want to start instrumenting them and do it in an open and vendor agnostic way. This is where you can benefit from OpenTelemetry.
What is OpenTelemetry?
OpenTelemetry is a set of APIs, SDKs, tooling, and integrations for signals such as traces, logs, metrics, and baggage. Each signal is a way for a system to describe something: e.g., for traces, a participant can describe what it is doing for an operation, and it can do it using the OpenTelemetry API. Signals operate independently from each other. OpenTelemetry support is available in multiple languages.
Its architecture carefully separates the cross-cutting concerns (e.g., API or semantic conventions) from the portions that can be managed independently (e.g., SDKs, which are implementations of the APIs). It is extensible in each layer: e.g., you can choose an exporter to send data to a backend system of your choice.
Tracing in OpenTelemetry
A trace is defined implicitly by its spans. The edges between spans capture the parent-child relationships. A Span represents a distinct unit of work by a participant in the overall trace. Each span has the following state: name, start and end timestamps, attributes (set of key-value pairs), zero or more events, parent span’s identifier, links to causally related spans, and the context information that needs to be propagated to child spans.
An easy way to get started with OpenTelemetry is to make use of its instrumentation libraries. These are libraries that enable OpenTelemetry support for another library (e.g., a HTTP client). You will find instrumentation libraries for common scenarios: collecting telemetry about outgoing HTTP requests, GRPC calls, etc.
Wrapping up
Observability is a key challenge in complex distributed systems comprised of thousands of microservices. With Distributed Tracing, we gain the ability to answer questions about such a system and can better understand it.