-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[consumer] Allow annotating consumer errors with metadata #9041
Changes from 35 commits
58704ec
a4f738d
6a932c1
2888dbc
4447c00
7ddbc0d
9e27e93
f4216c5
a1ca2e1
3fb5426
3b80b56
fe1802b
ea959b6
fc40a8a
05d6fa9
d19438a
d5f7a37
9fe7880
53cbc05
0d5af43
2aa9e86
8c05443
c8c40ef
c26e31a
3aadc29
a0b8bfd
6f85be2
b55f661
ef40c0d
dba92c6
dd69f4a
7300bda
e953920
a3fb82d
12ffaba
b3cbde5
1d1affc
e776426
404ebd5
be9b41a
ebe59e6
0bf7ee0
951b019
3a56990
55ea049
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Use this changelog template to create an entry for release notes. | ||
|
||
# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix' | ||
change_type: enhancement | ||
|
||
# The name of the component, or a single word describing the area of concern, (e.g. otlpreceiver) | ||
component: consumererror | ||
|
||
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`). | ||
note: Introduce an `Error` type that allows recording contextual information | ||
|
||
# One or more tracking issues or pull requests related to the change | ||
issues: [7047] | ||
|
||
# (Optional) One or more lines of additional information to render under the primary note. | ||
# These lines will be padded with 2 spaces and then inserted directly into the document. | ||
# Use pipe (|) for multiline entries. | ||
subtext: | | ||
Currently allows recording status codes on consumer errors, | ||
but will be expanded in the future to record additional data. | ||
|
||
# Optional: The change log or logs in which this entry should be included. | ||
# e.g. '[user]' or '[user, api]' | ||
# Include 'user' if the change is relevant to end users. | ||
# Include 'api' if there is a change to a library API. | ||
# Default: '[user]' | ||
change_logs: [api] |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,306 @@ | ||
# Consumer errors | ||
|
||
This package contains error types that should be returned by a consumer when an | ||
error occurs while processing telemetry. The error types included in this | ||
package provide functionality for communicating details about the error for use | ||
upstream in the pipeline. Ideally the error returned by a component in its | ||
`consume` function should be from this package. | ||
|
||
## Error classes | ||
|
||
**Retryable**: Errors are retryable if re-submitting data to a sink may result | ||
in a successful submission. | ||
|
||
**Permanent**: Errors are permanent if submission will always fail for the | ||
current data. Errors are considered permanent unless they are explicitly marked | ||
as retryable. | ||
Comment on lines
+14
to
+16
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For permanent errors, we should probably always return the "number of not accepted" entries? Is that reasonable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need to do that since the caller will already know how many items failed. |
||
|
||
## Use cases | ||
|
||
**Retry logic**: Errors should be allowed to include information necessary to | ||
perform retries. | ||
|
||
**Indicating partial success**: Errors can indicate that not all items were | ||
accepted, for example as in an OTLP partial success message. OTLP backends will | ||
return failed item counts if a partial success occurs, and this information can | ||
be propagated up to a receiver and returned to the caller. | ||
|
||
**Communicating network error codes**: Errors should allow embedding information | ||
necessary for the Collector to act as a proxy for a backend, i.e. relay a status | ||
code returned from a backend in a response to a system upstream from the | ||
Collector. | ||
Comment on lines
+28
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think @jpkrohling had a more detailed use-case here, but would be interesting to hear what problem we are solving. If I remember correctly was something that the collector should not retry and all retries to be done by the caller, but cannot remember exactly, so better to document in details here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can provide more examples if you like, but the simplest one is the one I described here: the Collector can act as a proxy and relay a code from a backend back to the caller. This works for retries too, if the code is e.g. an HTTP |
||
|
||
## Current targets for using errors | ||
evan-bradley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
**Receivers**: Receivers should be able to consume multiple errors downstream | ||
and determine the best course of action based on the user's configuration. This | ||
may entail either keeping the retry queue inside of the Collector by having the | ||
receiver keep track of retries, or may involve having the caller manage the | ||
retry queue by returning a retryable network code with relevant information. | ||
|
||
**scraperhelper**: The scraper helper can use information about errors from | ||
downstream to affect scraping. For example, if the backend is struggling with | ||
the volume of data, scraping could be slowed, or the amount of data collected | ||
could be reduced. | ||
|
||
**exporterhelper**: When an exporter returns a retryable error, the | ||
exporterhelper can use this information to retry. Permanent errors will be | ||
forwarded back up the pipeline. | ||
|
||
**obsreport**: Recording partial success information can ensure we correctly | ||
track the number of failed telemetry records in the pipeline. Right now, all | ||
records will be considered to be failed, which isn't accurate when partial | ||
successes occur. | ||
|
||
## Creating Errors | ||
|
||
Errors can be created by calling `consumererror.New(err, opts...)` where `err` | ||
is the underlying error, and `opts` is one of the provided options for supplying | ||
additional metadata: | ||
|
||
- `consumererror.WithGRPCStatus` | ||
- `consumererror.WithHTTPStatus` | ||
Comment on lines
+61
to
+62
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question, can these 2 be merged? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If a component uses both of these, the assumption is that it has its own HTTP<->gRPC translation, and wants to specify both statuses. The error will produce the given status if it has it, otherwise will convert from the status from the other transport. |
||
|
||
The following options are not currently available, but may be made available in | ||
the future: | ||
|
||
- `consumererror.WithRetry` | ||
- `consumererror.WithPartial` | ||
- `consumererror.WithMetadata` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not seeing any use-case for this, why do we need it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is for extensibility. I don't think we would add this until it was requested, but basically a custom exporter and receiver could communicate using a custom struct type by putting the struct inside the error with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe remove it for now? We should problably ensure the model is extendable but not over-prescriptive for something we don't have yet There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If extensibility is an explicit goal, I'd like to keep this in the design doc. I've added warnings to things that don't actually exist right now. |
||
|
||
All options can be combined, we assume that the component knows what it is doing | ||
when seemingly conflicting options. | ||
|
||
Two examples: | ||
|
||
- `WithRetry` and `WithPartial` are included together: Partial successes are | ||
considered permanent errors in OTLP, which conflicts with making an error | ||
retryable by including `WithRetry`. However, per our definition of what makes | ||
a permanent error, this error has been marked as retryable, and therefore we | ||
assume the component producing this error supports retyable partial success | ||
errors. | ||
Comment on lines
+84
to
+89
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my previous question about this and what it means if I know that 5 out of 10 spans failed to send. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I know that 5/10 spans failed, then I can record that in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When in doubt leave it out. I want us to not add any API if we don't have a good use-case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. This doesn't add any API, the point of this paragraph is just to explain how seemingly conflicting error combinations could be reconciled. |
||
- `WithGRPCStatus` and `WithHTTPStatus` are included together: While the | ||
component likely only sent data over one of these transports, our errors will | ||
produce the given status if it is included on the error, otherwise it will | ||
translate a status from the status for the other transport. If both of these | ||
are given, we assume the component wanted to perform its own status | ||
conversion, and we will simply give the status for the requested transport | ||
without performing any conversion. | ||
|
||
**Example**: | ||
|
||
```go | ||
consumererror.New(err, | ||
consumererror.WithRetry( | ||
consumerrerror.WithRetryDelay(10 * time.Second) | ||
), | ||
consumererror.WithGRPCStatus(codes.InvalidArgument), | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example must be invalid. In general, it still seems like all these options provide too much flexibility, making it confusing to use and easy to misuse. For example, why GRPC and HTTP have to be options? Why can't we just have different constructors like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an example of an intentionally unusual option combination since we describe above how combinations that are invalid in OTLP are handled. I'll move this and change it to a more normal combination so we're not highlighting anything we wouldn't recommend. It is worth noting that this is only non-retryable per the OTLP spec and other protocols could consider this a retryable code. I still think options are the best path forward even if they allow states that are not valid in OTLP, but I'm open to exploring different approaches. A few questions on how we would transition from HTTP/gRPC options to constructors:
Wouldn't this still be susceptible to the same issue, where an exporter creates a gRPC status with both
Could you give an example of how we would do this? |
||
``` | ||
|
||
### Retrying data submission | ||
|
||
> [!WARNING] This function is currently in the design phase. It is not available | ||
> and may not be added. The below is a design describing how this may work. | ||
|
||
If an error is transient, the `WithRetry` option corresponding to the relevant | ||
signal should be used to indicate that the error is retryable and to pass on any | ||
retry settings. These settings can come from the data sink or be determined by | ||
the component, such as through runtime conditions or from user settings. | ||
|
||
The data for the retry will be provided by the component performing the retry. | ||
This will require all processing to be completely redone; in the future, | ||
including data from the failed component so as to not retry this processing may | ||
be made as an available option. | ||
|
||
To ensure only the failed pipeline branch is retried, the sequence of components | ||
that created the error will be recorded by a pipeline utility as the error goes | ||
back up the pipeline. | ||
|
||
**Note**: If retry information is not included in an error, the error will be | ||
considered permanent and will not be retried. | ||
|
||
**Usage:** | ||
|
||
```go | ||
consumererror.WithRetry( | ||
consumerrerror.WithRetryDelay(10 * time.Second) | ||
) | ||
``` | ||
|
||
The delay is an optional setting that can be provided if it is available. | ||
|
||
### Indicating partial success | ||
|
||
> [!WARNING] This function is currently in the design phase. It is not available | ||
> and may not be added. The below is a design describing how this may work. | ||
|
||
If the component receives an OTLP partial success message (or other indication | ||
of partial success), it should include this information with a count of the | ||
failed records. | ||
|
||
**Usage:** | ||
|
||
```go | ||
consumererror.WithPartial(failedRecords) | ||
``` | ||
|
||
### Indicating error codes from network transports | ||
|
||
If the failure occurred due to a network transaction, the exporter should record | ||
the status code of the message received from the backend. This information can | ||
be then relayed to the receiver caller if necessary. Note that when the upstream | ||
component reads a code, it will read a code for its desired transport, and the | ||
code may be translated depending whether the input and output transports are | ||
different. For example, a gRPC exporter may record a gRPC status. If a gRPC | ||
receiver reads this status, it will be exactly the provided status. If an HTTP | ||
receiver reads the status, it wants an HTTP status, and the gRPC status will be | ||
converted to an equivalent HTTP code. | ||
|
||
**Usage:** | ||
|
||
```go | ||
consumererror.WithGRPCStatus(codes.InvalidArgument) | ||
consumererror.WithHTTPStatus(http.StatusTooManyRequests) | ||
``` | ||
|
||
### Including custom data | ||
|
||
> [!WARNING] This function is currently in the design phase. It is not available | ||
> and may not be added. The below is a design describing how this may work. | ||
|
||
Custom data can be included as well for any additional information that needs to | ||
be propagated back up the pipeline. It is up to the consuming component if or | ||
how this data will be used. | ||
|
||
**Usage:** | ||
|
||
```go | ||
consumererror.WithMetadata(MyMetadataStuct{}) | ||
``` | ||
|
||
To keep error analysis simple when looking at an error upstream in a pipeline, | ||
the component closest to the source of an error or set of errors should make a | ||
decision about the nature of the error. The following are a few places where | ||
special considerations may need to be made. | ||
|
||
## Using errors | ||
|
||
### Fanouts | ||
|
||
When a fanout receives multiple errors, it will combine them with | ||
`(consumererror.Error).Combine(errs...)` and pass them upstream. The upstream | ||
component can then later pull all errors out for analysis. | ||
|
||
### Retrieving errors | ||
|
||
> [!WARNING] This functionality is currently experimental, and the description | ||
> here is for design purposes. The code snippet may not work as-written. | ||
|
||
When a receiver gets a response that includes an error, it can get the data out | ||
by doing something similar to the following. Note that this uses the `ErrorData` | ||
type, which is for reading error data, as opposed to the `Error` type, which is | ||
for recording errors. | ||
|
||
```go | ||
cerr := consumererror.Error{} | ||
var errData []consumerError.ErrorData | ||
|
||
if errors.As(err, &cerr) { | ||
errData := cerr.Data() | ||
|
||
for _, data := range errData { | ||
data.HTTPStatus() | ||
data.Retryable() | ||
data.Partial() | ||
} | ||
} | ||
``` | ||
|
||
### Error data | ||
|
||
> [!WARNING] The description below is a design proposal for how this | ||
> functionality may work. See `error.go` within this package for the current | ||
> functionality. | ||
|
||
Obtaining data from an error can be done using an interface that looks something | ||
like this: | ||
|
||
```go | ||
type ErrorData interface { | ||
// Returns the underlying error | ||
Error() error | ||
|
||
// Second argument is `false` if no code is available. | ||
HTTPStatus() (int, bool) | ||
|
||
// Second argument is `false` if no code is available. | ||
GRPCStatus() (*status.Status, bool) | ||
|
||
// Second argument is `false` if no retry information is available. | ||
Retryable() (Retryable, bool) | ||
|
||
// Second argument is `false` if no partial counts were recorded. | ||
Partial() (Partial, bool) | ||
} | ||
|
||
type Retryable struct {} | ||
|
||
// Returns nil if no delay was set, indicating to use the default. | ||
// This makes it so a delay of `0` indicates to resend immediately. | ||
func (r *Retryable) Delay() *time.Duration {} | ||
|
||
type Partial struct {} | ||
``` | ||
|
||
## Other considerations | ||
|
||
### Mixed error classes | ||
|
||
When a receiver sees a mixture of permanent and retryable errors from downstream | ||
in the pipeline, it must first consider whether retries are enabled within the | ||
Collector. | ||
|
||
**Retries are enabled**: Ignore the permanent errors, retry data submission for | ||
only components that indicated retryable errors. | ||
|
||
**Retries are disabled**: In an asynchronous pipeline, simply do not retry any | ||
data. In a synchronous pipeline, the receiver should return a permanent error | ||
code indicating to the caller that it should not retry the data submission. This | ||
is intended to not induce extra failures where we know the data submission will | ||
fail, but this behavior could be made configurable by the user. | ||
|
||
### Signal conversion | ||
|
||
When converting between signals in a pipeline, it is expected that the connector | ||
performing the conversion should perform the translation necessary in the error | ||
for any signal item counts. If the converted count cannot be determined, the | ||
full count of pre-converted signals should be returned. | ||
|
||
### Asynchronous processing | ||
djaglowski marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The use of any components that do asynchronous processing in a pipeline will cut | ||
off error backpropagation at the asynchronous component. The asynchronous | ||
component may communicate error information using the Collector's own signals. | ||
|
||
## Transitioning | ||
|
||
> [!WARNING] This functionality is currently in the design phase. It is not | ||
> available and may not be added. The below is a design describing how this may | ||
> work. | ||
|
||
The following describes how to transition to these error types: | ||
|
||
- `NewPermanent`: To transition to new permanent errors, call | ||
`consumererror.New` with the relevant metadata included in the invocation. | ||
Errors will be permanent by default going forward. | ||
- `New[Traces|Metrics|Logs]`: These functions will be deprecated in favor of | ||
having the caller provide the data to retry. Current uses can invoke | ||
`consumererror.New` with the `WithRetry` option to retry a request. | ||
- `exporterhelper.NewThrottleRetry`: This will be replaced with `WithRetry`, and | ||
can follow a similar approach as above. | ||
|
||
`consumererror.IsPermanent` will be deprecated in favor of checking whether | ||
retry information is available, and only retrying if it has been provided. This | ||
will be possible by calling `ErrorData.Retryable()` and checking for retry | ||
information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to carry some informations:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a classification of error, I cover the details of what retryable errors contain here: https://github.com/open-telemetry/opentelemetry-collector/pull/9041/files#diff-e0fa8222784a2c0c2683b70e2b6a7ccf2c54b6477acd7e2518839e38060bdcf5R90.
We discussed this last week and determined that the caller can provide the data to retry, but we're going to focus on exactly what goes into a retryable error after this PR. Right now we're mainly focusing on the HTTP/gRPC errors.