-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HPA in mimir-distributed #3430
Comments
Thanks for opening this issue. I would be happy to review a PR for this if you have time. At the moment, only two components have had enough testing with autoscaling to support officially: querier and distributor. We're working on autoscaling other components, but need time to properly test the stability in our production environments. You can see the Jsonnet implementation for distributor and querier here:
What do you think about opening a PR to start? |
By the way, that jsonnet template is rendered here to plain yaml in case Jsonnet is not something you're familiar with:
|
Thanks, I'll consider opening a PR if I can find some time.
What kind of testing is usually done on these? Based on the scaling docs I was under the impression that any component could be scaled up and down safely (besides some exceptions for scaling down alertmanager, ingester, and store-gateway).
Not super familiar with Jsonnet so this is definitely helpful, thanks. |
We will typically run these kinds of things in production at Grafana Cloud for at least a few weeks to uncover anything unexpected. That doesn't mean we can't start experimenting, but it would need to be marked experimental to let users know we haven't run at scale yet.
You're right that most components can be scaled up and down. Specifically I'm referring to autoscaling, which can mean scaling up or down much more often than a human operator would ever do. For example, we've needed to fine tune shutdown behavior for queriers. It's not so big a deal to close a connection once in a while when a human scales down, but if a machine is scaling queriers every few minutes, then that single connection closed error can start to have a significant impact on the service availability, even if it doesn't lead to data loss or any permanent issues. |
Hi @Logiraptor, I've opened #4229 with some additions to create HPAs for the rest of the components since #4133 does not fully address this issue. Would love some feedback on this if you have some time to take a look, thanks. |
Hi, I would like to ask if I raise a PR, which would allow not setting the replicas in the helm charts on components such as distributor, querier, and query-frontend would it be accepted? There doesn't seem to be any work going on in the linked PR recently. I tried adding HPA via extra objects here, https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L3650. But since we used argocd it would revert the changes to the replicas done by the HPA, this would also happen with every deploy when just using helm. Allowing to not set the replicas count in the helm chart would than prevent this from happening. |
that makes sense to me @jmichalek132, I think we had the same problem with HPA migrations in jsonnet. How would this be implemented? Perhaps a special value for replicas - |
I think that sounds reasonable should I raise a PR for it? |
yes, thank you! |
So I did some testing, I started with
but helm complains when it's not null:
So I tried:
when replicas is a number i.e. the default |
can you open a draft PR and we can move the discussion there? I think this is somewhat tangential to the issue discussed here. Alternatively we can also keep the discussion going in another issue. |
Hi @jmichalek132, @dimitarvdimitrov any update on the progress of the autoscaling features? as whole mimir community is looking forward to it as it will provide more stability to the components |
#4687 is the furthest we have gone with autoscaling in the helm chart. That PR hasn't progressed a ton lately. This comment has some notes on what I see as the next steps for the PR #4687 (comment). Help would be appreciated :) |
+1 |
Is your feature request related to a problem? Please describe.
I have been running Grafana Mimir in our infrastructure using the Helm chart and I'm not currently able to automatically scale mimir in response to changes in load on the servers.
Describe the solution you'd like
It would be very helpful if Horizontal Pod Autoscalers were supported natively in the mimir-distributed helm chart.
Describe alternatives you've considered
I've attempted to add this manually using custom manifests with the chart to create HPAs for each component (Distributor, Querier, etc.), however, this is not an optimal solution since the
replicas
field on each Deployment and StatefulSet cannot be removed with the current templating. When using a custom HPA with the helm chart, applying any changes to one of the Deployments/StatefulSets causes a scale down to the number specified by thereplicas
field, followed by a scale up once the HPA has updated the desired replicas. In production, this is not ideal since it will potentially terminate active pods and disrupt our cluster's monitoring.As of now, my best solution is to fork the chart and add templating to omit this field, but it would be nice if this could be officially supported.
The text was updated successfully, but these errors were encountered: