Scaling #85

WyriHaximus · 2023-08-29T22:12:54Z

How is everyone scaling this chart? Looking for ways to scale when the webserver gets busier, but also when there is a ton of items in the queue and my instance would need more workers to process the data in the queue.

abbottmg · 2024-05-26T18:44:32Z

Right now I have a manually written autoscaler that watches CPU on the web container. For sidekiq I made an entry in values for each queue, so each queue gets its own ReplicaSet, with default pod count set in values. I then keep a loose eye out for traffic jams and manually scale. That mostly only matters if I catch a burst of dead jobs and retry them all.

I am interested in learning whether kubernetes supports custom metrics for autoscaling and if so how I could publish some sidekiq metrics to use there.

WyriHaximus · 2024-05-26T18:59:42Z

Right now I have a manually written autoscaler that watches CPU on the web container. For sidekiq I made an entry in values for each queue, so each queue gets its own ReplicaSet, with default pod count set in values. I then keep a loose eye out for traffic jams and manually scale. That mostly only matters if I catch a burst of dead jobs and retry them all.

Same, but watching CPU on sidekiq instead since I'm the only user.

I am interested in learning whether kubernetes supports custom metrics for autoscaling and if so how I could publish some sidekiq metrics to use there.

Kubernetes supports custom metrics, but you might want to have a look at keda.sh as it takes that to a whole new level.

timetinytim · 2025-03-11T08:57:56Z

I can share a bit about what we're doing in production for mastodon.social and mastodon.online.

Sidekiq

In our clusters, we have deployed a prometheus exporter that exports sidekiq queue statistics, and then we ingest that into datadog. We have a datadog operator installed on the cluster, which means we can set a HorizontalPodAutoscaler to scale based on a datadog metric, which in this case is the sidekiq queue latency. If the latency rises past a certain point, we can assume the jobs are filling up faster than can be processed, so we scale up the pods.

I haven't done this myself, but I know there's a prometheus operator available for kubernetes. You should be able to set up a custom APIService to grab metrics from the operator, and set up an HPA off of those in a similar fashion.

Web

We've actually been trying to figure out a good way to scale web pods ourselves for a while. At present we don't actually have an autoscaler set up in production.

However, there is an upcoming feature in Mastodon that allows for exporting prometheus metrics for both the web and sidekiq pods' ruby processes. This will be officially available in version v4.4, but the helm chart is already updated with the relevant configuration, and we are actively testing this in production with nighly builds to see if we can use these to autoscale. Of particular note is the ruby_puma_request_backlog metric, which seems like the most promising to scale off of (i.e. if the request backlog is growing, we have insufficient pods and need to scale up). But this requires more testing to know for sure.

I'll leave this issue open as we experiment, and will update in the future when we have more info~

timetinytim mentioned this issue Mar 11, 2025

Helm chart autoscaling doesn't scale the web deployment #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling #85

Scaling #85

WyriHaximus commented Aug 29, 2023

abbottmg commented May 26, 2024

WyriHaximus commented May 26, 2024

timetinytim commented Mar 11, 2025

Scaling #85

Scaling #85

Comments

WyriHaximus commented Aug 29, 2023

abbottmg commented May 26, 2024

WyriHaximus commented May 26, 2024

timetinytim commented Mar 11, 2025

Sidekiq

Web