RFC - Pipeline Component Telemetry #11406

djaglowski · 2024-10-09T19:57:58Z

This PR adds a RFC for normalized telemetry across all pipeline components. See #11343

edit by @mx-psi:

Announced on #otel-collector-dev on 2024-10-23: https://cloud-native.slack.com/archives/C07CCCMRXBK/p1729705290741179
Announced on the Collector SIG meeting from 2024-10-30

codecov · 2024-10-09T20:05:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.61%. Comparing base (c6828f0) to head (cb72f2a).
Report is 8 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #11406   +/-   ##
=======================================
  Coverage   91.61%   91.61%           
=======================================
  Files         443      443           
  Lines       23770    23770           
=======================================
  Hits        21776    21776           
  Misses       1620     1620           
  Partials      374      374

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

codeboten

Thanks for opening this as a RFC @djaglowski!

docs/rfcs/component-universal-telemetry.md

djaglowski · 2024-10-16T19:27:57Z

Based on some offline feedback, I've broadened the scope of the RFC, while simultaneously clarifying that it is intended to evolve as we identify additional standards.

jaronoff97

a few questions, I really like this proposal overall :)

docs/rfcs/component-universal-telemetry.md

jpkrohling · 2024-11-04T15:45:21Z

Some of my comments might have been discussed before, in which case, feel free to ignore me and just mark the items as resolved.

docs/rfcs/component-universal-telemetry.md

bogdandrutu

Approved with comments.

docs/rfcs/component-universal-telemetry.md

dmitryax

LGTM

This sets the level of all metrics that where not previously stabilized as alpha. Since many of these metrics will change as a result of open-telemetry#11406, it made sense to me to set their stability as alpha. Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>

mx-psi · 2024-11-22T08:59:57Z

This has enough approvals and has entered the 'final comment period'. I will merge this on 2024-11-27 if nobody blocks before.

cc @open-telemetry/collector-approvers

jpkrohling

There are a couple of things to iron out, but I'm giving my approval already, as those are details that could be part of a follow-up PR. I don't want to block progress on dependent tasks because of those two rather small points.

docs/rfcs/component-universal-telemetry.md

This sets the level of all metrics that where not previously stabilized as alpha. Since many of these metrics will change as a result of #11406, it made sense to me to set their stability as alpha. --------- Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>

djaglowski · 2024-11-25T14:23:30Z

I believe all feedback has been addressed.

#11743 represents two followup items raised by @jpkrohling, but I believe the RFC is clear that some changes are anticipated.

codeboten

Thanks @djaglowski

jpkrohling

I approved this before, but I'll approve again, to make it explicit that I'm OK with the latest state of this PR.

mx-psi · 2024-11-27T12:58:42Z

Per #11406 (comment) I am merging this 🎉

@mx-psi

## Description This PR defines observability requirements for components at the "Stable" stability levels. The goal is to ensure that Collector pipelines are properly observable, to help in debugging configuration issues. #### Approach - The requirements are deliberately not too specific, in order to be adaptable to each specific component, and so as to not over-burden component authors. - After discussing it with @mx-psi, this list of requirements explicitly includes things that may end up being emitted automatically as part of the Pipeline Instrumentation RFC (#11406), with only a note at the beginning explaining that not everything may need to be implemented manually. Feel free to share if you don't think this is the right approach for these requirements. #### Link to tracking issue Resolves #11581 ## Important note regarding the Pipeline Instrumentation RFC I included this paragraph in the part about error count metrics: > The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: > - only include errors internal to the component, or; > - allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components. The [Pipeline Instrumentation RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) (hereafter abbreviated "PI"), once implemented, should allow monitoring component errors via the `outcome` attribute, which is either `success` or `failure`, depending on whether the `Consumer` API call returned an error. Note that this does not work for receivers, or allow differentiating between different types of errors; for that reason, I believe additional component-specific error metrics will often still be required, but it would be nice to cover as many cases as possible automatically. However, at the moment, errors are (usually) propagated upstream through the chain of `Consume` calls, so in case of error the `failure` state will end up applied to all components upstream of the actual source of the error. This means the PI metrics do not fit the first bullet point. Moreover, I would argue that even post-processing the PI metrics does not reliably allow distinguishing the ultimate source of errors (the second bullet point). One simple idea is to compute `consumed.items{outcome:failure} - produced.items{outcome:failure}` to get the number of errors originating in a component. But this only works if output items map one-to-one to input items: if a processor or connector outputs fewer items than it consumes (because it aggregates them, or translates to a different signal type), this formula will return false positives. If these false positives are mixed with real errors from the component and/or from downstream, the situation becomes impossible to analyze by just looking at the metrics. For these reasons, I believe we should do one of four things: 1. Change the way we use the `Consumer` API to no longer propagate errors, making the PI metric outcomes more precise. We could catch errors in whatever wrapper we already use to emit the PI metrics, log them for posterity, and simply not propagate them. Note that some components already more or less do this, such as the `batchprocessor`, but this option may in principle break components which rely on downstream errors (for retry purposes for example). 3. Keep propagating errors, but modify or extend the RFC to require distinguishing between internal and propagated errors (maybe add a third `outcome` value, or add another attribute). This could be implemented by somehow propagating additional state from one `Consume` call to another, allowing us to establish the first appearance of a given error value in the pipeline. 5. Loosen this requirement so that the PI metrics suffice in their current state. 6. Leave everything as-is and make component authors implement their own somewhat redundant error count metrics. --------- Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com> Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>

…ry#11729) This sets the level of all metrics that where not previously stabilized as alpha. Since many of these metrics will change as a result of open-telemetry#11406, it made sense to me to set their stability as alpha. --------- Signed-off-by: Alex Boten <223565+codeboten@users.noreply.github.com>

@mx-psi

This PR adds a RFC for normalized telemetry across all pipeline components. See open-telemetry#11343 edit by @mx-psi: - Announced on #otel-collector-dev on 2024-10-23: https://cloud-native.slack.com/archives/C07CCCMRXBK/p1729705290741179 - Announced on the Collector SIG meeting from 2024-10-30 --------- Co-authored-by: Alex Boten <223565+codeboten@users.noreply.github.com> Co-authored-by: Damien Mathieu <42@dmathieu.com> Co-authored-by: William Dumont <william.dumont@grafana.com> Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>

@mx-psi

…ry#11772) ## Description This PR defines observability requirements for components at the "Stable" stability levels. The goal is to ensure that Collector pipelines are properly observable, to help in debugging configuration issues. #### Approach - The requirements are deliberately not too specific, in order to be adaptable to each specific component, and so as to not over-burden component authors. - After discussing it with @mx-psi, this list of requirements explicitly includes things that may end up being emitted automatically as part of the Pipeline Instrumentation RFC (open-telemetry#11406), with only a note at the beginning explaining that not everything may need to be implemented manually. Feel free to share if you don't think this is the right approach for these requirements. #### Link to tracking issue Resolves open-telemetry#11581 ## Important note regarding the Pipeline Instrumentation RFC I included this paragraph in the part about error count metrics: > The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: > - only include errors internal to the component, or; > - allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components. The [Pipeline Instrumentation RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) (hereafter abbreviated "PI"), once implemented, should allow monitoring component errors via the `outcome` attribute, which is either `success` or `failure`, depending on whether the `Consumer` API call returned an error. Note that this does not work for receivers, or allow differentiating between different types of errors; for that reason, I believe additional component-specific error metrics will often still be required, but it would be nice to cover as many cases as possible automatically. However, at the moment, errors are (usually) propagated upstream through the chain of `Consume` calls, so in case of error the `failure` state will end up applied to all components upstream of the actual source of the error. This means the PI metrics do not fit the first bullet point. Moreover, I would argue that even post-processing the PI metrics does not reliably allow distinguishing the ultimate source of errors (the second bullet point). One simple idea is to compute `consumed.items{outcome:failure} - produced.items{outcome:failure}` to get the number of errors originating in a component. But this only works if output items map one-to-one to input items: if a processor or connector outputs fewer items than it consumes (because it aggregates them, or translates to a different signal type), this formula will return false positives. If these false positives are mixed with real errors from the component and/or from downstream, the situation becomes impossible to analyze by just looking at the metrics. For these reasons, I believe we should do one of four things: 1. Change the way we use the `Consumer` API to no longer propagate errors, making the PI metric outcomes more precise. We could catch errors in whatever wrapper we already use to emit the PI metrics, log them for posterity, and simply not propagate them. Note that some components already more or less do this, such as the `batchprocessor`, but this option may in principle break components which rely on downstream errors (for retry purposes for example). 3. Keep propagating errors, but modify or extend the RFC to require distinguishing between internal and propagated errors (maybe add a third `outcome` value, or add another attribute). This could be implemented by somehow propagating additional state from one `Consume` call to another, allowing us to establish the first appearance of a given error value in the pipeline. 5. Loosen this requirement so that the PI metrics suffice in their current state. 6. Leave everything as-is and make component authors implement their own somewhat redundant error count metrics. --------- Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com> Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>

jade-guiton-dd · 2025-01-07T13:26:15Z

To make sure everyone involved is aware: I filed a PR (#11956) to amend this RFC. I am proposing adding a third outcome attribute value to make tracing the source of errors easier.

…11956) ### Context The [Pipeline Component Telemetry RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) was recently accepted (#11406). The document states the following regarding error monitoring: > For both [consumed and produced] metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise. [Observability requirements for stable pipeline components](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#observability-requirements) were also recently merged (#11772). The document states the following regarding error monitoring: > The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: > - only include errors internal to the component, or; > - allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components. Because errors are typically propagated across `ConsumeX` calls in a pipeline (except for components with an internal queue like `processor/batch`), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry. ### Description This PR amends the Pipeline Component Telemetry RFC with the following: - restrict the `outcome=failure` value to cases where the error comes from the very next component (the component on which `ConsumeX` was called); - add a third possible value for the `outcome` attribute: `rejected`, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected"); - propose a mechanism to determine which of the two values should be used. The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported `downstream` struct, which upstream layers could check for with `errors.As` to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details. ### Possible alternatives There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR: - loosen the observability requirements for stable components to not require distinguishing internal errors from downstream ones → makes it harder to identify the source of an error; - modify the way we use the `Consumer` API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with the `batch` prcessor); - let component authors make their own custom telemetry to solve the problem → higher barrier to entry, especially for people wanting to opensource existing components. --------- Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>

…pen-telemetry#11956) ### Context The [Pipeline Component Telemetry RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) was recently accepted (open-telemetry#11406). The document states the following regarding error monitoring: > For both [consumed and produced] metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise. [Observability requirements for stable pipeline components](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#observability-requirements) were also recently merged (open-telemetry#11772). The document states the following regarding error monitoring: > The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: > - only include errors internal to the component, or; > - allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components. Because errors are typically propagated across `ConsumeX` calls in a pipeline (except for components with an internal queue like `processor/batch`), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry. ### Description This PR amends the Pipeline Component Telemetry RFC with the following: - restrict the `outcome=failure` value to cases where the error comes from the very next component (the component on which `ConsumeX` was called); - add a third possible value for the `outcome` attribute: `rejected`, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected"); - propose a mechanism to determine which of the two values should be used. The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported `downstream` struct, which upstream layers could check for with `errors.As` to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details. ### Possible alternatives There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR: - loosen the observability requirements for stable components to not require distinguishing internal errors from downstream ones → makes it harder to identify the source of an error; - modify the way we use the `Consumer` API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with the `batch` prcessor); - let component authors make their own custom telemetry to solve the problem → higher barrier to entry, especially for people wanting to opensource existing components. --------- Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>

djaglowski mentioned this pull request Oct 9, 2024

Auto-instrumented pipeline components #11343

Closed

djaglowski force-pushed the component-telemetry-rfc branch from 99e3086 to 5df52e1 Compare October 10, 2024 13:05

djaglowski marked this pull request as ready for review October 10, 2024 13:36

djaglowski requested a review from a team as a code owner October 10, 2024 13:36

djaglowski requested a review from songy23 October 10, 2024 13:36

djaglowski added Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests labels Oct 10, 2024

codeboten reviewed Oct 10, 2024

View reviewed changes

jmacd mentioned this pull request Oct 10, 2024

WIP: pipeline monitoring otep open-telemetry/oteps#259

Closed

bogdandrutu reviewed Oct 10, 2024

View reviewed changes

bogdandrutu reviewed Oct 16, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

bogdandrutu reviewed Oct 16, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

djaglowski changed the title ~~RFC - Auto-instrumentation of pipeline components~~ RFC - Pipeline Component Telemetry Oct 16, 2024

jaronoff97 reviewed Oct 23, 2024

View reviewed changes

dmathieu reviewed Oct 24, 2024

View reviewed changes

jaronoff97 reviewed Oct 24, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

mx-psi approved these changes Oct 29, 2024

View reviewed changes

wildum reviewed Oct 30, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

wildum reviewed Oct 30, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved

wildum approved these changes Oct 30, 2024

View reviewed changes

jaronoff97 approved these changes Oct 30, 2024

View reviewed changes

evan-bradley approved these changes Nov 4, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

jpkrohling reviewed Nov 4, 2024

View reviewed changes

dmitryax reviewed Nov 6, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

djaglowski mentioned this pull request Nov 7, 2024

New top-level "otelcol" namespace open-telemetry/semantic-conventions#1555

Open

dmitryax reviewed Nov 17, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

bogdandrutu approved these changes Nov 18, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved

djaglowski force-pushed the component-telemetry-rfc branch from 7d7a75b to a7a15e5 Compare November 21, 2024 16:00

Fix markdown link check

02584b0

djaglowski added the rfc:final-comment-period This RFC is in the final comment period phase label Nov 21, 2024

dmitryax approved these changes Nov 21, 2024

View reviewed changes

codeboten mentioned this pull request Nov 21, 2024

[service] update telemetry level to reflect their state #11729

Merged

jpkrohling approved these changes Nov 22, 2024

View reviewed changes

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Show resolved Hide resolved

djaglowski mentioned this pull request Nov 25, 2024

RFC - Universal Component Telemetry #11743

Open

codeboten approved these changes Nov 25, 2024

View reviewed changes

Merge branch 'main' into component-telemetry-rfc

cb72f2a

jpkrohling approved these changes Nov 26, 2024

View reviewed changes

mx-psi merged commit 79357e8 into open-telemetry:main Nov 27, 2024
36 checks passed

github-actions bot added this to the next release milestone Nov 27, 2024

djaglowski deleted the component-telemetry-rfc branch November 27, 2024 15:28

jade-guiton-dd mentioned this pull request Nov 28, 2024

Define observability requirements for stable components #11772

Merged

djaglowski mentioned this pull request Dec 5, 2024

[chore][graph] Remodel node id as attribute sets #11344

Closed

JonasKunz mentioned this pull request Dec 6, 2024

Add SDK span telemetry metrics open-telemetry/semantic-conventions#1631

Open

3 tasks

jade-guiton-dd mentioned this pull request Dec 18, 2024

Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956

Merged

mx-psi mentioned this pull request Feb 6, 2025

Introduce component logger with appropriate attributes #12259

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC - Pipeline Component Telemetry #11406

RFC - Pipeline Component Telemetry #11406

djaglowski commented Oct 9, 2024 •

edited by mx-psi

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

codeboten left a comment

djaglowski commented Oct 16, 2024

jaronoff97 left a comment

jpkrohling commented Nov 4, 2024

bogdandrutu left a comment

dmitryax left a comment

mx-psi commented Nov 22, 2024

jpkrohling left a comment

djaglowski commented Nov 25, 2024

codeboten left a comment

jpkrohling left a comment

mx-psi commented Nov 27, 2024

jade-guiton-dd commented Jan 7, 2025

RFC - Pipeline Component Telemetry #11406

RFC - Pipeline Component Telemetry #11406

Conversation

djaglowski commented Oct 9, 2024 • edited by mx-psi Loading

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

codeboten left a comment

Choose a reason for hiding this comment

djaglowski commented Oct 16, 2024

jaronoff97 left a comment

Choose a reason for hiding this comment

jpkrohling commented Nov 4, 2024

bogdandrutu left a comment

Choose a reason for hiding this comment

dmitryax left a comment

Choose a reason for hiding this comment

mx-psi commented Nov 22, 2024

jpkrohling left a comment

Choose a reason for hiding this comment

djaglowski commented Nov 25, 2024

codeboten left a comment

Choose a reason for hiding this comment

jpkrohling left a comment

Choose a reason for hiding this comment

mx-psi commented Nov 27, 2024

jade-guiton-dd commented Jan 7, 2025

djaglowski commented Oct 9, 2024 •

edited by mx-psi

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading