Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query scheduler: Fix a panic in stats marshaling #8140

Merged
merged 4 commits into from
May 16, 2024

Conversation

seizethedave
Copy link
Contributor

@seizethedave seizethedave commented May 15, 2024

What this PR does

Fixes a panic in Querier services caused by request-processing goroutines that are still running for a short time after their contexts are canceled and are continuing to modify query stats while querier's scheduler attempts to marshal them.

Like so:

  1. scheduler processor invokes the querier handler.
  2. the querier handler (way down the call tree) forks off goroutines to do things. Those goroutines increment query stats. (like here)
  3. the handler context is canceled for whatever reason. (connection to frontend dies, timeout, ...)
  4. scheduler processor invokes QueryResult containing the error, response, and the same stats object those goroutines ^^ are still incrementing because they don't yet know the context has died. So the varint repr of the stats being marshaled are changing/growing during marshaling of the QueryResultRequest and the panic happens.

Which issue(s) this PR fixes or relates to

Fixes #7895

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@seizethedave seizethedave changed the title Querier scheduler: Fix a panic in stats marshaling Query scheduler: Fix a panic in stats marshaling May 15, 2024
@seizethedave seizethedave marked this pull request as ready for review May 15, 2024 04:50
@seizethedave seizethedave requested a review from a team as a code owner May 15, 2024 04:50
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

How hard would it be to add a test for this case that reliably fails without the fix?

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can even add a test it would be great. But I wouldn't block on it. LGTM

@seizethedave
Copy link
Contributor Author

How hard would it be to add a test for this case that reliably fails without the fix?

I added a test that fails the race detector reliably without making the copy. (AFAICT, our tests in CI run with race detector enabled, right?)

davidgrant@rube mimir % go test -race -count=1 -v -run "^TestSchedulerProcessor_QueryTime" github.com/grafana/mimir/pkg/querier/worker
=== RUN   TestSchedulerProcessor_QueryTime
=== RUN   TestSchedulerProcessor_QueryTime/query_stats_enabled_should_record_queue_time
=== RUN   TestSchedulerProcessor_QueryTime/query_stats_enabled_should_not_trigger_race_detector
==================
WARNING: DATA RACE
Read at 0x00c00003c5b0 by goroutine 48:
  github.com/grafana/mimir/pkg/querier/stats.(*Stats).Size()
      /Users/davidgrant/dev/mimir/pkg/querier/stats/stats.pb.go:367 +0x32c
  github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*QueryResultRequest).Size()
      /Users/davidgrant/dev/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:1060 +0x144
  github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*QueryResultRequest).Marshal()
      /Users/davidgrant/dev/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:800 +0x28
...

I also rigged up a couple of hacky local tests that I ran 100 times and saw the expected panic. (Not including all of the querier/block store querier machinery, but with a goroutine that just spins and increments stats a million times.) But making those hacky tests deterministic would be... an undertaking. :)

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the test looks good

@dimitarvdimitrov
Copy link
Contributor

But making those hacky tests deterministic would be... an undertaking. :)

we still track flaky tests and fix them if they become too annoying, so I think these tests are a pretty good addition

@dimitarvdimitrov
Copy link
Contributor

I'll merge before the weekend (& shutdown day)

@dimitarvdimitrov dimitarvdimitrov merged commit 1141091 into main May 16, 2024
29 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the davidgrant/querier-stat-panic branch May 16, 2024 16:42
francoposa pushed a commit that referenced this pull request May 27, 2024
* Querier: Copy the stats before marshaling.

* Add to changelog.

* Reuse Merge

* Add a test for the race detector.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Querier: panics in QueryResultRequest marshaling
3 participants