-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update transient errors retry timeout and retryable status codes #1842
Update transient errors retry timeout and retryable status codes #1842
Conversation
@@ -289,8 +286,6 @@ def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT: | |||
if error.code() in [ | |||
StatusCode.CANCELLED, | |||
StatusCode.DEADLINE_EXCEEDED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't cancelled and deadline exceeded errors raised by the client itself? Not 100% sure about deadline exceeded but I think cancelled should not be retried as it means (if I understand correctly) that the client decided to cancel the request for example in case it received some short of shutdown/exit signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is typically client but I believe this is allowed to cover the case when server does it https://grpc.io/docs/what-is-grpc/core-concepts/#cancelling-an-rpc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I found there is a table which lists what is retryable https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlpgrpc-response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Not a blocker for me then but I wonder if is there a simple way for the exporter to cancel an in-flight request on shutdown (with or without timeout)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I understand your concern. Anyway we have some work to do in making shutdown spec complaint we can address this there.
@@ -262,14 +262,11 @@ def _translate_data( | |||
pass | |||
|
|||
def _export(self, data: TypingSequence[SDKDataT]) -> ExportResultT: | |||
|
|||
max_value = 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to make this configurable? For example as an export_timeout: maximum duration for exporting spans
argument to the exporter?
My use case is CLI applications where anything having them hang on exit for 64 seconds because a telemetry backend is down is unacceptable. For comparison the Go implementation has two separate timeouts ExportTimeout
for the full operation and BatchTimeout
for the direct publish.
For now I have special subclass with a copy of this method to tweak the behavior but it's a maintenance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will bring up the topic of supporting configurability in next SIG meeting. Spec is not clear(at this point) in that regard and anything we do might become breaking change in future. And speaking of ExportTimeout
and BatchTimeout
python also has them as schedule_delay_millis
and export_timeout_millis
here I don't know why but we are only enforcing the export timeout only during force flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked about this in SIG call. This is out of scope for this. Consensus was this is not clear from Spec so we might end up making backwards incompatible change by introducing the support for configuring this. One thing we want to explore is making retry async and not blocking and also looking at how span processor timeout is behaving. Please create another issue and we will track it there.
Description
We agreed that OTLP exporter transient error retry timeout is way to high and change it to some reasonable number. This also updates the retryable codes list for grpc;
PERMISSION_DENIED
andUNAUTHENTICATED
do not belong to that list.Fixes #1670