Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consumer] Add new otlp-centric error type #11085

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

TylerHelmuth
Copy link
Member

Description

Adds a new otlp-centric error type. This type improves the ability for otlp exporters to report errors to otlp consumers, while also giving any component the ability to properly interpret http/grpc/retriable errors.

With this type, the http/grpc/retriable datasets are always considered according to the OTLP specification. If a non-otlp component receives this error type, that can properly map the http/grpc/retryable error to their form based on OTLP's interpretation of the underlying data.

For example, if the error type stored HTTP status 404 a component checking for the http status code within the error would know that the 404 means Not Found according to OTLP. If the component needed to translate 404 Not Found to their own response code, they would be able to trust the meaning.

Link to tracking issue

Continuation of #9041
Closes #7047

Testing

Added tests

Documentation

Added godoc comments

@TylerHelmuth TylerHelmuth force-pushed the consumer-otlp-error-type branch from 41e8af2 to b5c90ce Compare September 9, 2024 17:02
@@ -13,6 +13,8 @@ type permanent struct {

// NewPermanent wraps an error to indicate that it is a permanent error, i.e. an
// error that will be always returned if its source receives the same inputs.
//
// Deprecated: [v0.110.0] use Error instead.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe NewPermanent is no longer needed if we use Error. I marked this deprecated in this PR, but we could do it in a followup PR.

If we agree to do it in this PR I'll update all the usages to fix the linter.

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 72.97297% with 20 lines in your changes missing coverage. Please review.

Project coverage is 91.78%. Comparing base (59c083f) to head (5acc22d).
Report is 480 commits behind head on main.

Files with missing lines Patch % Lines
consumer/consumererror/error.go 67.56% 10 Missing and 2 partials ⚠️
...sumererror/internal/statusconversion/conversion.go 76.47% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11085      +/-   ##
==========================================
- Coverage   91.85%   91.78%   -0.08%     
==========================================
  Files         416      418       +2     
  Lines       19925    19996      +71     
==========================================
+ Hits        18302    18353      +51     
- Misses       1245     1263      +18     
- Partials      378      380       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +24 to +25
// Experimental: This API is at the early stage of development and may change
// without backward compatibility
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to mark consumer as 1.0 if we have this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plan from the other PR was to quickly remove it. We could also choose to not start with it.

Copy link
Member Author

@TylerHelmuth TylerHelmuth Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9041 includes a Combine function to aggregate multiple Error together. I can add that feature to this PR if it is still necessary.

Copy link
Member

@mwear mwear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this approach is becoming too specific to be useful. HTTP and gRPC status codes (and conversion between them) should be protocol agnostic. Is there any way we can design this so that we have a general consumer error that contains HTTP and gRPC status codes that can be extended to add protocol specific functionality as needed?

if e.retryable {
return status.New(codes.Unavailable, e.Error())
}
return status.New(codes.Internal, e.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered codes.Unknown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am an http fanboy, if there is any improperly used grpc codes please let me know. Is codes.Unknown the best thing to use here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think codes.Unknown is probably the best default. It's the default in our HTTP->gRPC code conversion logic, and is the fallback code according to the list that we refer to for conversions.

@TylerHelmuth
Copy link
Member Author

I wonder if this approach is becoming too specific to be useful. HTTP and gRPC status codes (and conversion between them) should be protocol agnostic. Is there any way we can design this so that we have a general consumer error that contains HTTP and gRPC status codes that can be extended to add protocol specific functionality as needed?

@mwear the discussion @evan-bradley @bogdandrutu @dmitryax and I had led us to the conclusion that a isolated http status or grpc status, with no knowledge of its source, led to ambiguity. Essentially a protocol-agnostic approach meant we could not be sure what the code meant.

For example, a non-OTLP exporter, call it exporter A, may use http protocol to make connections with some backend, but, similar to OTLP, it may be speaking a protocol on top of HTTP. That protocol could return a 500 http status code when it is trying to communicate to its clients that it was their fault. This would be in violation of http's definition of 500, but as long as the sever and client have agreed upon their protocol they'll be able to communicate effectively.

In the situation where exporter A wants to propagate this error back up the pipeline, recording a generic HTTP 500 status code will cause problems. In this scenario the 500 status code is representing a client error, but if that code made it back to a receiver than did not speak the same protocol as exporter A, such as the OTLP receiver, it will interpret the code incorrectly. In this scenario the OTLP receiver should return an error that indicates the client had a problem, likely a 400, but it would see the 500 and interpret it incorrectly as a server error.

Our solution to this was to force a protocol into the error and the one that made the most sense was OTLP. If we instead included an additional piece of metadata, such as WithProtocol, then all the receivers are left with the work to map from all the different possible protocols to their own. By building in the OTLP protocol, the work is now on the exporter, which only has to do a single mapping of its protocol to OTLP. Then all receivers only have to do a single mapping of OTLP to their protocol. And any component that speaks OTLP gets off easy.

If, in the initial scenario, there is a receiver A that speaks the same protocol, then exporter A could return its own special error type and receiver A could look for that error type. We aren't restricting what types of errors an exporter can send back up the pipeline, only providing a consistent, reliable container that we'll recommend using.

@jmacd
Copy link
Contributor

jmacd commented Sep 17, 2024

@TylerHelmuth I would like to add to the list of considerations that there are a couple of existing places where I think we need a protocol-agnostic error status. Discussed in #11183, there are situations where the exporterhelper's Timeout sender and Retry sender may (IMO) want to abort a request because the context deadline is shorter than the configured timeout or backoff interval. The root cause is that Golang's Context is protocol agnostic, and we have failures induced by the context which are not Canceled and not DeadlineExceeded. I'm looking for an Aborted status code. How would you propose to return protocol-agnostic errors from exporterhelper?

See #11198.

Have we considered using the gRPC error code space, since it was designed as a universal solution, while preserving protocol-specific error values too? For this case discussed above, I would expect to use a codes.Aborted w/ a message--it's a gRPC error code but it's a protocol-agnostic error.

@TylerHelmuth
Copy link
Member Author

@jmacd I may not be understanding the issue well enough, but the intention is that the exporter in question translate its grpc status to the OTLP equivalent grpc status and use NewOTLPGRPCError. That puts the burden of translation on the exporter, who is the most capable to understand both its own protocol and otlp, instead of the receivers.

Have we considered using the gRPC error code space

This is what we do today. It works for the most part, but fails as a transport mechanism for http since translating to and from http status codes and grpc status codes is lossy.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels Oct 18, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 16, 2024
Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good to me. It currently covers only one type of internal collector error Retryable and this is the only one we use at the moment. If we want to add something else in the future, this API doesn't restrict us from doing that. Otherwise, we can just use NewOTLPHTTPStatus for internal collector errors if more specifics is needed.

@bogdandrutu PTAL

return true
}
switch e.httpStatus {
case http.StatusTooManyRequests, http.StatusBadGateway, http.StatusServiceUnavailable, http.StatusGatewayTimeout:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 500 Internal Server Error retryable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we follow the OTLP spec for determining whether an error is retryable or non-retryable, and according to the OTLP spec 500 errors aren't retryable.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels Dec 10, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Blocked
Development

Successfully merging this pull request may close these issues.

Investigate how to expose exporterhelper.NewThrottleRetry in the consumererror
6 participants