-
Notifications
You must be signed in to change notification settings - Fork 164
Instrumentation layers and suppressing duplicates #172
Changes from all commits
5641953
0f7e292
273b955
8b81638
d655c9d
9adb92b
27d99cd
ee85a3c
f365dbe
3794ad0
75d5126
bd3eab0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,222 @@ | ||
# Instrumentation Layers and Suppression | ||
|
||
This document describes approach for tracing instrumentation layers and suppressing duplicate layers. | ||
|
||
## Motivation | ||
|
||
- Provide clarity for instrumentation layers: e.g. DB calls on top of REST API | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Related to the questions about a definition of layers: there are discussing about actually modelling HTTP request with multiple spans (for e. g. retries or forwards). Should we aim to cover those cases with a suppression proposal? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nope, for now, I assume we have one layer of HTTP spans (if we'll have more for logical + physical, I'd treat them as two different layers) - they can't carry the same information when both are enabled. HTTP instrumentation would need configuration to emit 1) logical or 2) physical or 3) both to avoid redundancy and doubling of costs. I don't see a way to achieve it with suppression |
||
- Give mechanism to suppress instrumentation layers for the same convention: e.g. multiple instrumented HTTP clients using each other. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are there cases of multiple instrumentation HTTP clients using each other? That sounds like an odd configuration of clients. I wonder if when combining these clients, one can disable instrumentation of the other? The DB call on top fo a REST call -- you propose to suppress the REST instrumentation, I take it? (Could the DB be configured with an un-instrumented REST client, instead?) I wonder if it would be possible to replace There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I suggest that
I'd love that, however There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Experimented with this idea: instead of SpanBuilder.setType, I added 'type' parameter to library instrumentation info when tracer is being built:
I updated the PR lmolkova/opentelemetry-java#1. Will bring it up at the Tue Instrumentation SIG meeting. I also entertained the idea of having a single tracer per type and returning noop tracers if one with the same type is registered - but it depends on the tracer creation time and is not stable. |
||
|
||
## Explanation | ||
|
||
### Spec changes proposal | ||
|
||
- Tracing Semantic Conventions: Each span MUST follow a single (besides general and potentially composite), convention, specific to the call it describes. It MUST specify it when creating a span. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As part of this, we'd also need to define which semantic conventions are composable and which are not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some semantic conventions define it (e.g. Couch DB wants to have HTTP attributes on DB spans). Unless it's already explicitly defined, we say conventions are not composable. |
||
- Tracing Semantic Conventions: Client libraries instrumentation MUST make context current to enable correlation with underlying layers of instrumentation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume this doesn't work for manual parenting (without using context). However, I also assume that's not a big deal, as context will be used anyway to parent spans across library boundaries? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should work just fine with context propagated manually. The suppression should be done through key on the context and how it's propagated is done is irrelevant. However, I expect the duplication problem to heavily affect magical instrumentation. Apps that do most of the things manually have enough control to prevent it without |
||
- Trace API: Add `InstrumentationType` expandable enum with predefined values for known trace conventions (HTTP, RPC, DB, Messaging (see open questions)). *Type* is just a convention name. | ||
- Trace SDK: Add span creation option to set `InstrumentationType` | ||
- During span creation, checks if span should be suppressed (if there is another one for kind + type on the parent `Context`) and if there is, returns a *suppressed* span, which is | ||
- non-recording | ||
- propagating (carries parent context) | ||
- does not become current (i.e. `makeCurrent` call with it is no-op) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there ever a case that a user may want this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suppressed spans are never reported and are basically duplicates. If we'd make them current, following may happen:
|
||
- OTel SDK SHOULD allow suppression strategy configuration | ||
- suppress nested by kind (e.g. only one CLIENT allowed) | ||
- suppress nested by kind + convention it follows (only one HTTP CLIENT allowed, but outer DB -> nested HTTP is ok) | ||
- suppress none | ||
|
||
Note: some conventions may explicitly include others (e.g. [FaaS](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/faas.md) may include [HTTP](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/http.md)). For the purpose of this document, we assume span follows single convention and instrumentation MUST include explicitly mentioned sub-conventions (except general) only. | ||
[General](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/span-general.md) convention attributes are allowed on all spans when applicable, pure generic instrumentation are not suppressed. | ||
|
||
#### Suppression Example | ||
|
||
- Instrumentation - HTTP Client 1: | ||
- starts span | ||
- SDK checks if there is a span of the same kind + type on the parent context | ||
- there is no similar span | ||
- SDK returns new span | ||
- stores it in the context (presumably current) | ||
- SDK stores span in context behind `ContextKey` which is combination of kind and type (depending on configuration) | ||
- Instrumentation - HTTP Client 2 (in scope of span1): | ||
- [Optional suggestion]: add API to check if span should be started (optimization for power-users) | ||
- starts span | ||
- SDK checks if there is a span of the same kind + type on the parent context | ||
- there is one already | ||
- SDK returns suppressed (non-recording, propagating span, it's also never becomes current). | ||
- HTTP client continues instrumentation with suppressed span | ||
- adds attributes: noop | ||
- injects context: duplicate work, since parent context is already injected and suppressed span carries the same context | ||
- makes suppressed span current: current is not modified, scope is noop | ||
- Instrumentation - TLS (imaginary, in scope of span2) | ||
- works as if it's in scope of span1 | ||
- current is span1 | ||
- if explicit parent was used (span2), it carries span1 context | ||
|
||
Suppression logic is configurable and encapsulated in span builder SDK implementation - specific library instrumentation should not depend on configuration, e.g.: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What should be the implementing libraries / sdks default? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. libraries never think about the configuration. I believe suppression by kind + type is a good default since it shows all layers suppressing duplicate instrumentations only. |
||
|
||
- suppressing by kind only - context key does not distinguish types within kind | ||
- suppressing by kind + type - context key is per type and kind | ||
- suppressing none | ||
|
||
## Internal details | ||
|
||
Client libraries frequently use common protocols (HTTP, gRPC, DB drivers) to perform RPC calls, which are usually instrumented by OpenTelemetry. | ||
At the same time, client library is rarely a thin client and may need its own instrumentation to | ||
|
||
- connect traces to application code | ||
- provide extra context: | ||
- duration of composite operations | ||
- overall result of composite operation | ||
- any extra library-specific information not available on transport call span | ||
|
||
Both, client library 'logical' and transport 'physical' spans are useful. They also rarely can be combined together because they have 1:many relationship. | ||
|
||
So instrumentations form *layers*, where each layer follows specific convention or no convention at all. Spans that are not convention-specific (generic manual spans, INTERNAL spans) are never suppressed. | ||
|
||
*Example*: | ||
|
||
- HTTP SERVER span | ||
- DB CLIENT call - 1 | ||
- HTTP CLIENT call - 1 | ||
- DNS CLIENT | ||
- TLS CLIENT | ||
- HTTP CLIENT call - 2 | ||
|
||
There are two HTTP client spans under DB call, they are children of DB client spans. DB spans follow DB semantics only, HTTP spans similarly only follow HTTP semantics. If there are other layers of instrumentation (TLS) - it happens under HTTP client spans. | ||
|
||
### Duplication problem | ||
|
||
Duplication is a common issue in auto-instrumentation: | ||
|
||
- HTTP clients frequently are built on top of other HTTP clients, making multiple layers of HTTP spans. | ||
- Web frameworks have multiple layers, and auto-instrumentation is applied to many of them, causing duplicate instrumentation, depending on the app configuration. | ||
- Libraries may decide to add native instrumentation for common protocols like HTTP or gRPC: | ||
- to support legacy correlation protocols | ||
- to make better decisions on failures (e.g. 404, 409) | ||
- give better library-specific context | ||
- support users that can't use auto-instrumentation | ||
|
||
So what happens in reality without attempts to suppress duplicates: | ||
|
||
- HTTP SERVER span (middleware) | ||
- HTTP SERVER span (servlet) | ||
- Controller INTERNAL span | ||
- HTTP CLIENT call - 1 (Google HTTP client) | ||
- HTTP CLIENT call - 1 (Apache HTTP client) | ||
|
||
#### Proposed solution | ||
|
||
Suppress inner layers of the same instrumentation, i.e. above picture translates into: | ||
|
||
- HTTP SERVER span (middleware) | ||
- Controller INTERNAL span | ||
- HTTP CLIENT call - 1 (Google HTTP client) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can I ask what data is lost in this process of suppression? Is the intent that all the data that would be part of the spans that are now being surpressed, is that being discard, aggregated into the parent span? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We assume that two different HTTP clients send almost identical spans (if they follow OTel HTTP spec). The duration might be slightly different, but essentially they report the same thing. So we'll keep the outer span and fully disregard the inner span. |
||
|
||
To do so, instrumentation declares convention (`InstrumentationType`) when starting a span and SDK: | ||
|
||
- checks if span with same kind + type is present on context already | ||
- yes: backs off, never starting a span | ||
- no: starts a span and sets it on the context, e.g. by writing a span on the context under the key (where key is function of kind and type). | ||
|
||
For this to work between different instrumentations (native and auto), the API to set type must be in Trace API. | ||
|
||
### Configuration | ||
|
||
Suppression strategy should be configurable on SDK side - instrumentation does not need to know about it. | ||
|
||
Configuration is needed since: | ||
|
||
- backends don't always support nested CLIENT spans (extra hints needed for Application Map to show outbound connection) | ||
- users may prefer to reduce verbosity and costs by suppressing spans of same kind | ||
|
||
So following strategies should be supported: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I'd really like to know is what information is going to be lost because of this? For example, during an incident the suppressed data could help provide useful clues into the problematic calls to either downstreams or internal calls that would have made up part of the span. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, e.g. you have DB running over HTTP.
I understand your concern about suppression in general, what I hear from users some of them may want to sacrifice a bit of verbosity to save money. And kind + type suppression only suppresses duplicate data for those who can afford it. |
||
|
||
- suppress all nested of same kind | ||
- suppress all nested of same kind + type (default?) | ||
- suppress none (mostly for debugging instrumentation code and internal observability) | ||
|
||
#### Suppression examples: kind and type | ||
|
||
- HTTP SERVER | ||
- HTTP CLIENT - ok | ||
|
||
- HTTP SERVER | ||
- HTTP SERVER - suppressed | ||
|
||
- HTTP SERVER | ||
- MESSAGING CONSUMER - ok // open questions around receive/process | ||
|
||
- MESSAGING PRODUCER // open questions around receive/process | ||
- HTTP CLIENT - ok | ||
|
||
- MESSAGING CLIENT | ||
- HTTP CLIENT - ok | ||
|
||
#### Suppression examples: kind | ||
|
||
- HTTP SERVER | ||
- HTTP CLIENT - ok | ||
|
||
- HTTP SERVER | ||
- HTTP SERVER - suppressed | ||
|
||
- HTTP SERVER | ||
- MESSAGING CONSUMER - ok | ||
|
||
- MESSAGING PRODUCER // open questions around client/producer uncertainty | ||
- HTTP CLIENT - ok | ||
|
||
- MESSAGING CLIENT | ||
- HTTP CLIENT - suppressed | ||
|
||
### Implementation | ||
|
||
#### Trace API | ||
|
||
[Proof of concept in Java](https://github.com/lmolkova/opentelemetry-java/pull/1) | ||
|
||
- introduces a new `SuppressedSpan` implementation. It's almost the same as `PropagatingSpan` (i.e. sampled-out), with the difference that `makeCurrent` is noop | ||
- `PropagatingSpan` can't be reused since sampling out usually assumes to sample out this and all following downstream spans. Sampled out span becomes current and causes side-effects (span.current().setAttribute()) not compatible with suppression. | ||
- adds extendable `InstrumentationType` and `SpanBuilder.setType(InstrumentationType type)` | ||
- adds optional optimization `Tracer.shouldStartSpan(name, kind, type)` | ||
|
||
#### Instrumentation API | ||
|
||
[Instrumentation API in Java implementation](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/instrumenter/SpanKey.java) with suppression by type. | ||
|
||
## Trade-offs and mitigation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to more included in this section, I find it hard to suggest this to internal users without understanding the implications to the data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what do you want to see in this section? How to choose a suppression strategy setting for a service assuming it will make it to the spec or what are the trade-offs of introducing these API changes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wooh, sorry. I forgot I had made a comment here. Something that tells me the risk / potential loss of data that I am expecting. |
||
|
||
Trace API change is needed to support native library instrumentations - taking dependency on unstable experimental instrumentation API (or common contrib code) is not a good option. | ||
|
||
## Prior art and alternatives | ||
|
||
- Terminal context: suppressing anything below | ||
- Exposing spans stack and allowing to walk it accessing span properties | ||
- Suppress all nested spans of same kind | ||
- Make logical calls INTERNAL | ||
|
||
Discussions: | ||
|
||
- [Client library + auto instrumentation](https://github.com/open-telemetry/opentelemetry-specification/issues/1767) | ||
- [Prevent duplicate telemetry when using both library and auto instrumentation](https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/903) | ||
- [Generic mechanism for preventing multiple Server/Client spans](https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/465) | ||
- [Proposal for modeling nested CLIENT instrumentation](https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/1822) | ||
- [SpanKind with layers](https://github.com/open-telemetry/opentelemetry-specification/issues/526) | ||
- [Should instrumentations be able to interact with or know about other instrumentations](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/369) | ||
- [Server instrumentations should look for parent spans in current context before extracting context from carriers](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/445) | ||
- [CLIENT spans should update their parent span's kind to INTERNAL](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/456) | ||
|
||
## Open questions | ||
|
||
- Should we suppress by direction (inbound/outbound) instead of kind? Suggestion: let's fix current messaging quirks that cause concern here | ||
- Messaging CONSUMER spans (CONSUMER *receive* is a parent of CONSUMER *process* and seem to violate [kind definition](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind)) | ||
- I'm going to drive changing it in messaging spec (*receive* is normal CLIENT span based on the ) | ||
- Messaging PRODUCER/CLIENT uncertainty: it's changing, see [Creation and publishing](https://github.com/open-telemetry/oteps/pull/173#discussion_r704737276) | ||
- Proposal: They are two different things | ||
- Creation is PRODUCER | ||
- Publish is CLIENT | ||
|
||
- Should we have `Tracer.shouldStart(spanName, kind, type, ?)` or `SpanBuilder.shouldStart()` methods to optimize instrumentation. If it's not called, everything works, just not too efficient | ||
|
||
- Backends need hint to separate logical CLIENT spans from physical ones | ||
- Good default (suppress by kind or kind + type). Up to distro + user. | ||
- Should we have configuration option to never suppress anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you define the term "instrumentation layer" in this document? I'm not sure if it refers to a single span, or to the whole span tree that's created by a single library.