Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling #85

Open
WyriHaximus opened this issue Aug 29, 2023 · 3 comments
Open

Scaling #85

WyriHaximus opened this issue Aug 29, 2023 · 3 comments

Comments

@WyriHaximus
Copy link
Contributor

How is everyone scaling this chart? Looking for ways to scale when the webserver gets busier, but also when there is a ton of items in the queue and my instance would need more workers to process the data in the queue.

@abbottmg
Copy link
Contributor

Right now I have a manually written autoscaler that watches CPU on the web container. For sidekiq I made an entry in values for each queue, so each queue gets its own ReplicaSet, with default pod count set in values. I then keep a loose eye out for traffic jams and manually scale. That mostly only matters if I catch a burst of dead jobs and retry them all.

I am interested in learning whether kubernetes supports custom metrics for autoscaling and if so how I could publish some sidekiq metrics to use there.

@WyriHaximus
Copy link
Contributor Author

Right now I have a manually written autoscaler that watches CPU on the web container. For sidekiq I made an entry in values for each queue, so each queue gets its own ReplicaSet, with default pod count set in values. I then keep a loose eye out for traffic jams and manually scale. That mostly only matters if I catch a burst of dead jobs and retry them all.

Same, but watching CPU on sidekiq instead since I'm the only user.

I am interested in learning whether kubernetes supports custom metrics for autoscaling and if so how I could publish some sidekiq metrics to use there.

Kubernetes supports custom metrics, but you might want to have a look at keda.sh as it takes that to a whole new level.

@timetinytim
Copy link
Contributor

I can share a bit about what we're doing in production for mastodon.social and mastodon.online.

Sidekiq

In our clusters, we have deployed a prometheus exporter that exports sidekiq queue statistics, and then we ingest that into datadog. We have a datadog operator installed on the cluster, which means we can set a HorizontalPodAutoscaler to scale based on a datadog metric, which in this case is the sidekiq queue latency. If the latency rises past a certain point, we can assume the jobs are filling up faster than can be processed, so we scale up the pods.

I haven't done this myself, but I know there's a prometheus operator available for kubernetes. You should be able to set up a custom APIService to grab metrics from the operator, and set up an HPA off of those in a similar fashion.

Web

We've actually been trying to figure out a good way to scale web pods ourselves for a while. At present we don't actually have an autoscaler set up in production.

However, there is an upcoming feature in Mastodon that allows for exporting prometheus metrics for both the web and sidekiq pods' ruby processes. This will be officially available in version v4.4, but the helm chart is already updated with the relevant configuration, and we are actively testing this in production with nighly builds to see if we can use these to autoscale. Of particular note is the ruby_puma_request_backlog metric, which seems like the most promising to scale off of (i.e. if the request backlog is growing, we have insufficient pods and need to scale up). But this requires more testing to know for sure.

I'll leave this issue open as we experiment, and will update in the future when we have more info~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants