Modifying Metrics in OpenTelemetry

Sometimes a metric is super useful but not always in the correct format to be fully useable. An example of this are some of the metrics from cert-manager, knowing when a certificate is going to expire and when it is going to be renewed. However, when this metric is scraped via Prometheus it is in Epoch (number of seconds since 1970-01-01Z00:00:00), which isn't normally a problem. But there are times when you need to perform comparisons, and the tool that you are using needs an ISO formatted date.

So the pattern in elastic is to create an ingest pipeline that parses the field, nice this is a centralised way to do this but means that you need to inspect every metric coming in to see if it needs to be updated. While this would seem to be the simplest approach to take it does mean that there would be increased processing overhead when millions of metrics are being sent.

However, we have a distributed tool that can do the processing for us; the open-telemetry collector. By using the collector we can move the smarts (and the overhead) to the source of the metrics, and reduce the need to add complexity and overhead at the central location.

In the collector there is the transform processor which can be added to a pipeline, to modify the signals (in this case metrics) as they are received and exported. The transform processor is built on top of the OpenTelemetry Transformation Language (OTTL), which provides a domain specific programming language to process signals using opentelemetry native concepts and constructs.

A previous post showed the ability to transform the name of a metric, but with OTTL more complex transformations can be applied. These transformations can also be restricted to groups of metrics, or even specific metrics using conditions. Furthermore, we can transform specific parts of a metric by limiting the context to the section of the metric that need to be modified.

transform:
  metric_statements:
  - conditions:
    - metric.name == "certmanager.certificate.renewal.timestamp.seconds"
    - metric.name == "certmanager.certificate.expiration.timestamp.seconds"
    context: metric
    statements:
    - set(metric.unit, "s")
    - replace_pattern(name, "(.+).seconds", "$$1")

In the above block we are restricting the scope of the statements using conditions: and context:. The logic in the conditions block is; thing1 || thing2, not anded. The metric context is the lowest level context for a metric signal, for example another context in metrics is datapoint, which will be covered further down.

The first statement set(metric.unit "s"), sets the units for the metric. I'm doing this here because in OpenTelemetry metrics have units, whereas in Prometheus the metrics are either implied or are encoded in the metric name. So by being able to have the metric units in a standard format as an attribute, we can remove them from the name of the metric, which leads onto the following. In the final statement replace_pattern(name, "(.+).seconds", "$$1"), the metrics name has the regex pattern applied to it. In this case we are matching and collecting everything up to the final string .seconds and then replacing the value of the name with everything except the final .seconds.

transform:
  metric_statements:
  - conditions:
    - metric.name == "certmanager.certificate.renewal.timestamp"
    - metric.name == "certmanager.certificate.expiration.timestamp"
    context: datapoint
    statements:
    - set(attributes["datetime"], FormatTime(Unix(Int(value_double)), "%Y-%m-%d %H:%M:%S"))

Now that the metric units and name have been transformed we can move onto the timestamp, to restrict the scope of our change to the datapoint context. This is a list of datapoints that can also contain attributes, in our case we only have one datapoint. I've not seen any metrics with multiple datapoints as yet so can't comment on how this would behave.

Decomposing the statement; we're setting the datapoint attribute called datetime with the value from FormatTime(), which is parsing the number of seconds from Unix(). In order for Unix() to be able to parse the number of seconds the value of the datapoint needs to be cast to an Int().