Amend Pipeline Component Telemetry RFC to add a "rejected" outcome (#…

…11956) ### Context The [Pipeline Component Telemetry RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md) was recently accepted (#11406). The document states the following regarding error monitoring: > For both [consumed and produced] metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise. [Observability requirements for stable pipeline components](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#observability-requirements) were also recently merged (#11772). The document states the following regarding error monitoring: > The goal is to be able to easily pinpoint the source of data loss in the Collector pipeline, so this should either: > - only include errors internal to the component, or; > - allow distinguishing said errors from ones originating in an external service, or propagated from downstream Collector components. Because errors are typically propagated across `ConsumeX` calls in a pipeline (except for components with an internal queue like `processor/batch`), the error observability mechanism proposed by the RFC implies that Pipeline Telemetry will record failures for every component interface upstream of the component that actually emitted the error, which does not match the goals set out in the observability requirements, and makes it much harder to tell which component errors are coming from from the emitted telemetry. ### Description This PR amends the Pipeline Component Telemetry RFC with the following: - restrict the `outcome=failure` value to cases where the error comes from the very next component (the component on which `ConsumeX` was called); - add a third possible value for the `outcome` attribute: `rejected`, for cases where an error observed at an interface comes from further downstream (the component did not "fail", but its output was "rejected"); - propose a mechanism to determine which of the two values should be used. The current proposal for the mechanism is for the pipeline instrumentation layer to wrap errors in an unexported `downstream` struct, which upstream layers could check for with `errors.As` to check whether the error has already been "attributed" to a component. This is the same mechanism currently used for tracking permanent vs. retryable errors. Please check the diff for details. ### Possible alternatives There are a few alternatives to this amendment, which were discussed as part of the observability requirements PR: - loosen the observability requirements for stable components to not require distinguishing internal errors from downstream ones → makes it harder to identify the source of an error; - modify the way we use the `Consumer` API to no longer propagate errors upstream → prevents proper propagation of backpressure through the pipeline (although this is likely already a problem with the `batch` prcessor); - let component authors make their own custom telemetry to solve the problem → higher barrier to entry, especially for people wanting to opensource existing components. --------- Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
open-telemetry · Jan 28, 2025 · 1c4726a · 1c4726a
1 parent b4af6f1
commit 1c4726a
Showing 1 changed file with 14 additions and 5 deletions.
diff --git a/docs/rfcs/component-universal-telemetry.md b/docs/rfcs/component-universal-telemetry.md
@@ -90,11 +90,20 @@ The location of these measurements can be described in terms of whether the data
 component to which the telemetry is attributed. Metrics which contain the term "produced" describe data which is emitted from the component,
 while metrics which contain the term "consumed" describe data which is received by the component.
 
-For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to
-whether or not the corresponding function call returned an error. Specifically, consumed measurements will be recorded with `outcome` as
-`failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, produced
-measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and
-`success` otherwise.
+For both metrics, an `outcome` attribute with possible values `success`, `failure`, and `refused` should be automatically recorded,
+based on whether the corresponding function call returned successfully, returned an internal error, or propagated an error from a
+component further downstream.
+
+Specifically, a call to `ConsumeX` is recorded with:
+- `outcome = success` if the call returns `nil`;
+- `outcome = failure` if the call returns a regular error;
+- `outcome = refused` if the call returns an error tagged as coming from downstream.
+After inspecting the error, the instrumentation layer should tag it as coming from downstream before returning it to the parent component.
+
+The upstream component which called `ConsumeX` will have this `outcome` attribute applied to its produced measurements, and the downstream
+component that `ConsumeX` was called on will have the attribute applied to its consumed measurements.
+
+Errors should be "tagged as coming from downstream" the same way permanent errors are currently handled: they can be wrapped in a `type downstreamError struct { err error }` wrapper error type, then checked with `errors.As`. Note that care may need to be taken when dealing with the `multiError`s returned by the `fanoutconsumer`. If PR #11085 introducing a single generic `Error` type is merged, an additional `downstream bool` field can be added to it to serve the same purpose instead.
 
 ```yaml
     otelcol.receiver.produced.items: