Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: Cancelled/Aborted GRPC Requests Increment thanos_objstore_bucket_operation_failures_total #3149

Closed
ipstatic opened this issue Sep 10, 2020 · 2 comments · Fixed by #3179

Comments

@ipstatic
Copy link
Contributor

Thanos, Prometheus and Golang version used:
Thanos: v0.13.0
Prometheus: 2.15.2
Golang: 1.14.1

Object Storage Provider:
GCS

What happened:
We noticed a continued increases in the thanos_objstore_bucket_operation_failures_total metric while not seeing errors in the log file. After looking at other metrics it appears that when a request from query is timed out/cancelled/aborted, the thanos_objstore_bucket_operation_failures_total metric increases. We also confirmed this from the GCS side by seeing an increased rate of CANCELLED API calls around the times we noticed that the bucket operation failures metric increased.

Screen Shot 2020-09-10 at 10 43 58 AM

What you expected to happen:
thanos_objstore_bucket_operation_failures_total not to increase.

How to reproduce it (as minimally and precisely as possible):
Run a large query from querier that will hit its query timeout.

Full logs to relevant components:
I can include logs if desired but they just show normal block caching operations. Even in debug mode there is nothing about a failure.

@GiedriusS
Copy link
Member

GiedriusS commented Sep 11, 2020

It seems to me like we need to add if !errors.Cause(err, context.Canceled) { increaseOperationFailures() }. Probably it would be even better if we'd check if the gRPC request has been aborted. Help wanted!

@ipstatic
Copy link
Contributor Author

That is what is puzzling me. Query shows the request as cancelled but store shows the request as aborted, not cancelled. Is there a timer that we could be hitting? Also, I would love to help but I don't know where in the code this is getting executed. Mind pointing me in a general direction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants