OpenTelemetry Introduction
July 10, 2022 | DevOps, Tracing, Monitoring | ...
What is OpenTelemetry
OpenTelemetry has been formed as a merge by the teams of OpenTracing and OpenCensus while providing full compatibility to existing technologies in the field.
OpenTracing has created a set of vendor-agnostic APIs to collect tracing data in applications running primarily as Cloud Native. This means that a developer who wants to collect tracing data in their application does not have to implement a vendor’s API in their code, but instead uses an open API that is then implemented by the respective vendors. Similar to Slf4J as a logging API that allows Log4J, Logback etc. to be used as a logging backend. OpenCensus is a project by Google to collect observability data through standard formats that are also vendor agnostic. Thus, OpenTelemetry is a project that creates vendor agnostic APIs for instrumenting applications and at the same time collecting that data from multiple sources and only then passing it to a vendor’s tool.
So, if you are using Prometheus and Jaeger today, you might want to switch over to using OpenTelemetry APIs and the OpenTelemetry collector as an intermediary, and only then send the tracing and metrics data to your observability tools. With this abstraction, you can use multiple observability tools in place at the same time, route data to different places, sample data, etc. All of this is independent of the specifics of vendors. However, you are then dependent on OpenTelemetry.
The observability signals for OpenTelemetry are:
- Logs: Traditional logs printed to console, file, or in case of a cloud native application, to a log-collector which indexes them and provides a search interface. If logs are written in the context of a trace, the trace and logs can be linked and made discoverable together.
- Traces: A trace consisting of spans, which are basic function calls, follows the path of invocations no matter in which service they are invoked.
- Metrics: Timeseries based data. For example traffic count, error count, response times, etc
Additionally, OpenTelemetry specifies Baggage which are basically tags for events.
As of July, 2022, only Traces and Metrics have been fully specified. They both have well tested instrumentation libraries in the most common languages and frameworks. The OpenTelemetry collector is also able to export these to common observability tools such as Prometheus, OpenCensus, Zipkin and Jaeger as well as Kafka. Logs are not fully specified nor implemented, but it’s being worked on right now. Check out the OpenTelemetry Status Page to see an up-to-date status.
The need for standards
In 2010 Google has published a paper about Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Some called it the secret weapon Google has to understand their architectures and debug complex request traces.
Dapper later evolved into the open sourced project Zipkin. AWS jumped onto the distributed tracing bandwagon and introduced AWS X-Ray. Uber liked distributed tracing as well, but didn’t like some decisions taken in Zipkin, so they created Jaeger.
Each of these tools introduced span propagation methods to be able to follow a trace as it makes it’s way through an architecture consisting of many services:
x-b3-sampled: 1
x-b3-spanid: 9090044efa29991f
x-b3-traceid: de5a14e8f69781b31299211599430ed2
x-amzn-trace-id: Root=1-de5a14e8-f69781b31299211599430ed2;Parent=9090044efa29991f;Sampled=1
uber-trace-id: de5a14e8f69781b31299211599430ed2:9090044efa29991f:0:01
Zipkin later introduced a single header standard as well:
b3: …
Finally, some people were annoyed by the variety of propagation formats, so they created a W3C recommendation:
traceparent: 00-de5a14e8f69781b31299211599430ed2-9090044efa29991f-01
However, the software APIs to instrument your code and to collect those signals was still wild west. So, OpenTelemetry was formed to come to the rescue.
Instrumentation
Before OpenTelemetry, you could choose to use a library to instrument your code to gather tracing and metrics data and collect it with your vendor’s tool. For example, for tracing you could use the Zipkin libraries, Jaeger libraries, AWS X-Ray libraries or possibly others.
Now, these often provided compatibility to other tools, however you still needed to depend on a single vendor. And to be fair, libraries such as Spring Cloud Sleuth abstracted the vendor library for you, but you still needed to include it in your application.
Nowadays, Zipkin and Jaeger have deprecated their instrumentation libraries and recommend the usage of OpenTelemetry instrumentation.
Collection
Before OpenTelemetry, you had to send your metrics, traces and logs to separate vendor specific tools.
For tracing data collection vendors introduced these formats:
- Zipkin Format
- Protobuf
- Thrift
And transport it over these:
- HTTP(S)
- gRPC
- UDP
While the transports probably all have a justified use case, the formats should be unified. So, OpenTelemetry introduced the OpenTelemetry Protocol OTLP. It not only allows to send tracing data, but logs, metrics and tracing.
Once they were collected by a vendor’s tool, it was hard to get them out of there to use it for other interesting purposes. In a large organization, it might be useful to send traces to a team specific tool, but also collect all traces produced by all teams to generate so called RED metrics.
- R - Requests: Traffic, Throughput, Rate
- E – Errors: Error Counts, Error Rate, Failed Calls
- D – Durations: Latency, Elapsed Time
With the OpenTelemetry Collector in-between, this becomes a breeze, as you can define pipelines in the collector which sends your data to different channels downstream. If you ever want to evaluate a new monitoring tool, you simply forward your data there, but continue using your existing monitoring tools until you switch over. Or maybe the tools are complementary and you decide to use multiple tools generating insights based on the same source data.
Upcoming
This post was a first introduction into OpenTelemetry. I am planning to release a series of posts around this topic.
- Introduction to OpenTelemetry (this article)
- Introduction to Distributed Tracing
- OpenTelemetry Collector Components (upcoming)
- OpenTelemetry Instrumentation (upcoming)
- Instrumenting React.js applications (upcoming)
- Instrumenting Angular applications (upcoming)
- Full Stack OpenTelemetry Example (upcoming)