Skip to content

Commit

Permalink
Amend Pipeline Component Telemetry RFC to add a "rejected" outcome (#…
Browse files Browse the repository at this point in the history
…11956)

### Context

The [Pipeline Component Telemetry
RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md)
was recently accepted (#11406). The document states the following
regarding error monitoring:
> For both [consumed and produced] metrics, an `outcome` attribute with
possible values `success` and `failure` should be automatically
recorded, corresponding to whether or not the corresponding function
call returned an error. Specifically, consumed measurements will be
recorded with `outcome` as `failure` when a call from the previous
component the `ConsumeX` function returns an error, and `success`
otherwise. Likewise, produced measurements will be recorded with
`outcome` as `failure` when a call to the next consumer's `ConsumeX`
function returns an error, and `success` otherwise.


[Observability requirements for stable pipeline
components](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#observability-requirements)
were also recently merged (#11772). The document states the following
regarding error monitoring:
> The goal is to be able to easily pinpoint the source of data loss in
the Collector pipeline, so this should either:
> - only include errors internal to the component, or;
> - allow distinguishing said errors from ones originating in an
external service, or propagated from downstream Collector components.

Because errors are typically propagated across `ConsumeX` calls in a
pipeline (except for components with an internal queue like
`processor/batch`), the error observability mechanism proposed by the
RFC implies that Pipeline Telemetry will record failures for every
component interface upstream of the component that actually emitted the
error, which does not match the goals set out in the observability
requirements, and makes it much harder to tell which component errors
are coming from from the emitted telemetry.

### Description

This PR amends the Pipeline Component Telemetry RFC with the following:
- restrict the `outcome=failure` value to cases where the error comes
from the very next component (the component on which `ConsumeX` was
called);
- add a third possible value for the `outcome` attribute: `rejected`,
for cases where an error observed at an interface comes from further
downstream (the component did not "fail", but its output was
"rejected");
- propose a mechanism to determine which of the two values should be
used.

The current proposal for the mechanism is for the pipeline
instrumentation layer to wrap errors in an unexported `downstream`
struct, which upstream layers could check for with `errors.As` to check
whether the error has already been "attributed" to a component. This is
the same mechanism currently used for tracking permanent vs. retryable
errors. Please check the diff for details.

### Possible alternatives

There are a few alternatives to this amendment, which were discussed as
part of the observability requirements PR:
- loosen the observability requirements for stable components to not
require distinguishing internal errors from downstream ones → makes it
harder to identify the source of an error;
- modify the way we use the `Consumer` API to no longer propagate errors
upstream → prevents proper propagation of backpressure through the
pipeline (although this is likely already a problem with the `batch`
prcessor);
- let component authors make their own custom telemetry to solve the
problem → higher barrier to entry, especially for people wanting to
opensource existing components.

---------

Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
  • Loading branch information
jade-guiton-dd and mx-psi authored Jan 28, 2025
1 parent b4af6f1 commit 1c4726a
Showing 1 changed file with 14 additions and 5 deletions.
19 changes: 14 additions & 5 deletions docs/rfcs/component-universal-telemetry.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,20 @@ The location of these measurements can be described in terms of whether the data
component to which the telemetry is attributed. Metrics which contain the term "produced" describe data which is emitted from the component,
while metrics which contain the term "consumed" describe data which is received by the component.

For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to
whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as
`failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced
measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and
`success` otherwise.
For both metrics, an `outcome` attribute with possible values `success`, `failure`, and `refused` should be automatically recorded,
based on whether the corresponding function call returned successfully, returned an internal error, or propagated an error from a
component further downstream.

Specifically, a call to `ConsumeX` is recorded with:
- `outcome = success` if the call returns `nil`;
- `outcome = failure` if the call returns a regular error;
- `outcome = refused` if the call returns an error tagged as coming from downstream.
After inspecting the error, the instrumentation layer should tag it as coming from downstream before returning it to the parent component.

The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream
component that `ConsumeX` was called on will have the attribute applied to its consumed measurements.

Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose instead.

```yaml
otelcol.receiver.produced.items:
Expand Down

0 comments on commit 1c4726a

Please sign in to comment.