You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanos, Prometheus and Golang version used: Thanos: v0.13.0 Prometheus: 2.15.2 Golang: 1.14.1
Object Storage Provider:
GCS
What happened:
We noticed a continued increases in the thanos_objstore_bucket_operation_failures_total metric while not seeing errors in the log file. After looking at other metrics it appears that when a request from query is timed out/cancelled/aborted, the thanos_objstore_bucket_operation_failures_total metric increases. We also confirmed this from the GCS side by seeing an increased rate of CANCELLED API calls around the times we noticed that the bucket operation failures metric increased.
What you expected to happen: thanos_objstore_bucket_operation_failures_total not to increase.
How to reproduce it (as minimally and precisely as possible):
Run a large query from querier that will hit its query timeout.
Full logs to relevant components:
I can include logs if desired but they just show normal block caching operations. Even in debug mode there is nothing about a failure.
The text was updated successfully, but these errors were encountered:
It seems to me like we need to add if !errors.Cause(err, context.Canceled) { increaseOperationFailures() }. Probably it would be even better if we'd check if the gRPC request has been aborted. Help wanted!
That is what is puzzling me. Query shows the request as cancelled but store shows the request as aborted, not cancelled. Is there a timer that we could be hitting? Also, I would love to help but I don't know where in the code this is getting executed. Mind pointing me in a general direction?
Thanos, Prometheus and Golang version used:
Thanos: v0.13.0
Prometheus: 2.15.2
Golang: 1.14.1
Object Storage Provider:
GCS
What happened:
We noticed a continued increases in the
thanos_objstore_bucket_operation_failures_total
metric while not seeing errors in the log file. After looking at other metrics it appears that when a request from query is timed out/cancelled/aborted, thethanos_objstore_bucket_operation_failures_total
metric increases. We also confirmed this from the GCS side by seeing an increased rate ofCANCELLED
API calls around the times we noticed that the bucket operation failures metric increased.What you expected to happen:
thanos_objstore_bucket_operation_failures_total
not to increase.How to reproduce it (as minimally and precisely as possible):
Run a large query from querier that will hit its query timeout.
Full logs to relevant components:
I can include logs if desired but they just show normal block caching operations. Even in debug mode there is nothing about a failure.
The text was updated successfully, but these errors were encountered: