Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MimirSchedulerQueriesStuck alert and runbook to reflect querier auto-scaling #3223

Merged
merged 4 commits into from
Oct 15, 2022

Conversation

treid314
Copy link
Contributor

What this PR does

With our current HPA querier auto-scaling setup it takes sometimes more than a minute for us to detect that we need to scale up and request the additional querier replicas be created. This change allows more time for that to happen before the alert triggers.

Checklist

  • [ -] Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@treid314 treid314 requested review from osg-grafana and a team as code owners October 14, 2022 19:27
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With our current HPA querier auto-scaling setup it takes sometimes more than a minute for us to detect that we need to scale up and request the additional querier replicas be created. This change allows more time for that to happen before the alert triggers.

The alert has a for: 5m, so it needs to fire for 5 consecutive minutes before alerting. The [1m] is just the window to look. Actually increasing the [1m] window potentially increase the chances the alert fires. I don't think this change does what you want.

@treid314
Copy link
Contributor Author

Even if it's the minimum over that timeframe, not the max? My thought was that timeframe would catch after the queriers have been added and the queues hopefully drop back to 0.

@pracucci
Copy link
Collaborator

Even if it's the minimum over that timeframe, not the max? My thought was that timeframe would catch after the queriers have been added and the queues hopefully drop back to 0.

Right, it's the min. Yeah doesn't change anyway. I think you want to increase the for duration.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pracucci pracucci merged commit 485765c into main Oct 15, 2022
@pracucci pracucci deleted the increase-stuck-query-timeframe branch October 15, 2022 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants