OpenTelemetry Data Quality Insights with the Instrumentation Score and Elastic

OpenTelemetry adoption is rapidly increasing and more companies rely on OpenTelemetry to collect observability data. While OpenTelemetry offers clear specifications and semantic conventions to guide telemetry data collection, it also introduces significant flexibility. With high flexibility comes high responsibility — many things can go wrong with OTel-based data collection, easily resulting in mediocre or low-quality telemetry. Poor data quality can hinder backend analysis, confuse users, and degrade system performance. To unlock actionable insights from OpenTelemetry data, maintaining high data quality is essential. The Instrumentation Score initiative addresses this challenge by providing a standardized way to measure OpenTelemetry data quality. Although the specification and tooling are still evolving, the underlying concepts are already compelling. In this blog post, I’ll share my experience experimenting with the Instrumentation Score concept and demonstrate how to use the Elastic Stack — utlizing ES|QL, Kibana Task Manager, and Dashboards — to build a POC for data quality analysis based on this approach within Elastic Observability.

Instrumentation Score - The Power of Rule-based Data Quality Analysis

When you first hear the term "Instrumentation Score", your initial reaction might be: "OK, there's a single, percentage-like metric that tells me my instrumentation (i.e. OTel data) has a score of 60 out of 100. So what? How does it help me?"

However, the Instrumentation Score is much more than just a single number. Its power lies in the individual rules from which the score is calculated. The rule definitions' rationale, impact level, and criteria provide an evaluation framework that enables you to drill down into data quality issues and identify specific areas for improvement. Also, the Instrumentation Score specification does not mandate specific tools and implementation details for calculating the score and rule evaluations.

As I explored the Instrumentation Score concepts, I developed the following mental model for deriving actionable insights.

The Score

The score itself is an indicator of the quality of your telemetry data. The lower the number, the more room for improvement with your data quality. In general, if a score falls below 75, you should consider fixing your instrumentation and data collection.

Breakdown by Instrumentation Score Rules

Exploring the evaluation results of individual Instrumentation Score rules will give you insights into what is wrong with your data quality. In addition, the rules' rationales explain why the violation of a rule is problematic.

As an example, let's take the SPA-002 rule:

Description:

Traces do not contain orphan spans.

Rationale:

Orphaned spans indicate potential issues in tracing instrumentation or data integrity. This can lead to incomplete or misleading trace data, hindering effective troubleshooting and performance analysis.

If your data violates the SPA-002 rule, you know what is wrong (i.e. you have broken traces), and the rationale explains why that is an issue (i.e. degraded analysis capabilities).

Breakdown by Services

When you have a large system with hundreds or maybe even thousands of entities (such as services, Kubernetes pods, etc.), a binary signal on all of the data — such as "has a certain rule been passed or not" — is not really actionable. Is the data from all services violating a certain rule, or just a small subset of services?

Breaking down rule evaluation by services (and potentially other entity types) may help you to identify where there are issues with data quality. For example, let's assume only one service — the cart-service — (out of your fifty services) is affected by the violation of rule SPA-002. With that information, you can focus on fixing the instrumentation for the cart-service instead of having to check all fifty services.

Once you know which services (or other entities) violate which Instrumentation Score rules, you're very close to actionable insights. However, there are two more things that I found to be extremely useful for data quality analysis when I was experimenting with the Instrumentation Score evaluation: (1) a quantitative indication of the extent, and (2) concrete examples of rule violation occurrences in your data.

Quantifying the Rule Violation Extent

The Instrumentation Score spec already defines an impact level (e.g. NORMAL, IMPORTANT, CRITICAL) per rule. However, this only covers the "importance" of the rule itself, not the extent of a rule violation. For example, if a single trace (out of a million traces) on your service has an orphan span, technically speaking the rule SPA-002 is violated. But is it really a relevant issue if only one out of a million traces is affected? Probably not. It definitely would be if half of your traces were broken.

Hence, having a quantitative indication of the extent of a rule violation per service — e.g. "40% of your traces violate SPA-002" — would provide additional information on how severe a rule violation actually is.

Tangible Examples

Finally, nothing is as meaningful and self-explanatory as tangible, concrete examples from your own data. If the telemetry data of your cart-service violates SPA-002 (i.e., has traces with orphan spans), wouldn't you want to see a concrete trace from that service that demonstrates the rule violation? Analyzing concrete examples may give you hints about the root cause of broken traces — or, more generally, why your data violates Instrumentation Score rules.

Instrumentation Score with Elastic

The Instrumentation Score spec does not prescribe tool usage or implementation details for the calculation of the score and evaluation of the rules. This allows for integrating the Instrumentation Score concept with whatever backend your OpenTelemetry data is being sent to.

With the goal of building a POC for an end-to-end integration of the Instrumentation Score with Elastic Observability, I combined the powerful capabilities of ES|QL with Kibana's task manager and dashboarding features.

Each Instrumentation Score rule can be formulated as an ES|QL query that covers the steps described above:

rule passed or not
breakdown by services
calculation of the extent
sampling of an example occurrence

Here is an example query for the LOG-002 rule that checks the validity of the severity_number field:

FROM logs-*.otel-* METADATA _id
| WHERE data_stream.type == "logs"
    AND @timestamp > NOW() - 1h
| EVAL no_sev = severity_number IS NULL OR severity_number == 0
| STATS 
    logs_wo_severity = COUNT(*) WHERE no_sev,
    example = SAMPLE(_id, 1) WHERE no_sev,
    total = COUNT(*)
      BY service.name
| EVAL rule_passed = (logs_wo_severity == 0),
    extent = CASE(total != 0, logs_wo_severity / total, 0.0)
| KEEP rule_passed, service.name, example, extent

These rule evaluation queries are wrapped in a Kibana instrumentation-score plugin that utilizes the task manager for regular execution. The instrumentation-score plugin then takes the results from all the evaluation queries for the different rules and calculates the final instrumentation score value (overall and broken down by service) following the Instrumentation Score spec's calculation formula. The resulting instrumentation score values, as well as the rule evaluation results (with the examples and extent) are then stored in separate Elasticsearch indices for consumption.

With the results stored in dedicated Elasticsearch indices, we can build Dashboards to visualize the Instrumentation Score insights and allow users to troubleshoot their data quality issues.

In this POC I implement subet of instrumentation score rules to prove out the approach.

The Instrumentation Score concept accommodates extension with your own custom rules. I did that in my POC as well to test some quality rules that are not yet formalized as rules in the Instrumentation Score spec, but are important for Elastic Observability to provide the maximum value from the OTel data.

Applying the Instrumentation Score on the OpenTelemetry Demo

The OpenTelemetry Demo is the most-used environment to play around with and showcase OpenTelemetry capabilities. Initially, I thought the demo would be the worst environment to test my Instrumentation Score implementation. After all, it's the showcase environment for OpenTelemetry, and I expected it to have an Instrumentation Score close to 100. Surprisingly, that wasn't the case.

Let's start with the overview.

The Overview

This dashboard shows an overview of the Instrumentation Score results for the OpenTelemetry Demo environment. The first thing you might notice is the very low overall score 35 (top-left corner). The table in the bottom-left corner shows a breakdown of the score by services. Somewhat surprisingly, all the service scores are higher than the overall score. How is that possible?

The main reason is that Instrumentation Score rules have, by definition, a binary result — passed or not. So it can happen that each service fails a single but distinct rule. Hence, the service score is not perfect but also not too bad. But, from the overall perspective, many rules have failed (each by a different service), hence, leading to a very low overall score.

In the table on the right, we see the results for the individual rules with their description, impact level, and example occurrences. We see that 7 out of 11 implemented rules have failed. Let's pick our favorite example from earlier — SPA-002 (in row 5), the orphan spans rule.

With the dashboard indicating that the rule SPA-002 has failed, we know that there are orphan spans somewhere in our OTel traces. But where exactly?

For further analysis, we have two ways to drill down: (1) into a specific rule to see which services violate a specific rule, or (2) into a specific service to see which rules are violated by that service.

Rule Drilldown

The following dashboard shows a detailed view into the rule evaluation results for individual rules. In this case we selected rule SPA-002 at the top.

In addition to the rule's meta information, such as its description, rationale, and criteria, we see some statistics on the right. For example, we see that 2 services have failed that rule, 16 passed, and for 19 services this rule is not applicable (e.g., because those don't have tracing data). In the table below, we see which two services are impacted by this rule violation: the frontend and frontend-proxy services. For each service, we also see the extent. In the case of the frontend service, around 20% of traces have orphan spans. This information is crucial as it gives an indication of how severe the rule violation actually is. If it had been under 1%, this problem might have been negligible, but with one trace out of five being broken, it definitely needs to be fixed. Also, for each of the services, we have an example span.id for which no spans could be found but that are referenced in the parent.id by other spans. This allows us to perform further analyses (e.g., by investigating the referring spans in Kibana's Discover) on concrete example cases.

With that view, we now know that the frontend service has a good amount of broken traces. But is that service also violating other rules? And, if yes, which?

Service Drilldown

To answer the above question we can switch to the Per Service Dashboard.

In this dashboard, we see similar information as on the overview dashboard, however, filtered on a single selected service (e.g., frontend service in this example). In the table, we see that the frontend service violates three rules. We already know about SPA-002 from the previous section. In addition, the violation of the custom rule SPA-C-001 shows that around 99% of transaction span names have high cardinality. In Elastic Observability, transactions refer to service-local root spans (i.e., entry points into services). In the example value, we see directly why the span.names (here referred to as transaction.names) have high cardinality. The span name contains unique identifiers (here the session ID) as part of the URL that the span name is constructed from in the instrumentation. As the EDOT Collector derives metrics for transaction-type spans, we also can observe a violation of the MET-001 which requires bound cardinality on metric dimensions.

As you can see, with the Instrumentation Score concept and a few different breakdown views, we were able to pinpoint data quality issues and identify which services and instrumentations need improvement to fix the issues.

Learnings and Observations

My experimentation with the Instrumentation Score was very insightful and showed me the power of this concept — though it's still in its early phase. It is particularly insightful if the implementation and calculation include breakdowns by meaningful entities, such as services, K8s pods, hosts, etc. With such a breakdown, you can narrow down data quality issues to a manageable scope, instead of having to sift through huge amounts of data and entities.

Furthermore, I realized that having some notion of problem extent (per rule and service), as well as concrete examples, helps make the problem more tangible.

Thinking further about the idea of rule violation extent, there might even be a way to incorporate that into the score formula itself. In my humble opinion, this would make the score significantly more comparable and indicative of the actual impact. I proposed this idea in an issue on the Instrumentation Score project.

Conclusion

The Instrumentation Score is a powerful approach to ensuring a high level of data quality with OpenTelemetry.

Thank you to the maintainers — Antoine Toulme, Daniel Gomez Blanco, Juraci Paixão Kröhling, and Michele Mancioppi — for bringing this great project to life, and to all the contributors for their participation!

With proper implementation of the rules and score calculation, users can easily get actionable insights into what they need to fix in their instrumentation and data collection. The Instrumentation Score rules are in an early stage and are steadily improved and extended. I'm looking forward to what the community will build in the scope of this project in the future, and I hope to intensify my contributions as well.