Why we use Open-Telemetry

At MadeCurious we are starting to roll Open-Telemetry out across our applications. The initial driver for this was after we started reviewing observability platforms, which was driven by a desire to have a central location for logs, metrics, and monitoring. We discovered as part of this review that most, if not all, of the vendors were moving to support OTEL as a means of gathering and collecting signals.

Given that we could replace the functionality of the multiple systems that we were using previously with a single one, and use a vendor-neutral tool, we saw this as a signal (pun intended) that we should adopt OTEL as our preferred choice.

But wait, what is Open-Telemetry?

Open-Telemetry is an observability framework and toolkit that allows application developers to generate signals to report on the health and status of distributed applications. These signals are generated using instrumentation from the API and SDK for the language that is used to write the application. There are also signals that are collected from other sources like, but not limited to, OS metrics and logs.

The signals that OTEL generates are

  • Metrics
  • Logs
  • Traces

These are broadly defined as the 3 pillars of observability.

Metrics

These represent the state of the application, or system at the time that they are generated. By collecting and aggregating them over time we can build up a picture of the behaviour of that that metric represents.

Because metrics can represent different states they need to have a type which best reflects their use; for example in a car the speedometer is a gauge that can go up and down, the odometer is a counter of Km the vehicle has travelled. With OTEL there is another type of metric that is available, the histogram, which is used to count the distribution of values over a number of buckets.

Logs

A log is a timestamped record of information about a system. With OTEL these logs can be enhanced with additional information to allow the events to be cross-linked or correlated, this allows us to be able to look at all the events for a single request from an end-user for example.

Traces

Traces are similar to logs, but they have more contextual information and also collect state and latency/duration. A trace is constructed from many smaller units called spans which are correlated by a parent child relationship. Traces are expensive in that they are big picture records of how a request moved through a system. However, with this cost comes a more complete view of the state and behaviour of that system.

So why Open-Telemetry

The why depends on the lens that you look through with regard to the applications that we write and maintain.

From a developers view, being able to look at a trace for a failing request and see that the source of that was deep down in the depths of the application stack. And that that view showed them the line or function, and what triggered the failure. Because they are now no longer need to spend significant amounts of time and headspace on investigation it means that they are able to spend more time doing the fun stuff of actually writing code.

From an operational view, being able to collect and aggregate metrics over an extended period of time facilitates the ability to identify trends in behaviour. These trends can then be acted upon to improve many aspects of systems. E.g. identify bottlenecks in a system that can be addressed in such a way that latency is reduced, or that the cost of a system is reduced.

In the early days of adopting OTEL at MadeCurious there was scepticism from some parts as to the benefits. It hasn't taken long for these opinions to be changed.

What reduced the amount of scepticism was when the developers were able to use OTEL to address some long-standing bugs and performance issues that had been lurking for quite some time (read years). Also by being able to bring the metrics, logs, and traces into one back-end are able to reduce the amount of time and effort needed to fingerprint problems that we were unaware of previously.