-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEDA Kafka-Scaler not scaling out anymore if close to max replica count #4791
Comments
Hi! |
Very interesting: I tried setting I did another run without that option and monitored the And as you already mentioned, this makes sense looking at the following lines https://github.com/kedacore/keda/blob/main/pkg/scalers/kafka_scaler.go#L660. So, if For example: 22 current replicas * (24k actual lag / 22k desired lag) = 1.09, which is below the HPA tolerance, so no scale up to 24 replicas happens. So, if this is the way it's supposed to work, I guess we can close this ticket. Personally speaking, this feels confusing because I don't want to I know my scenario is probably a very rare edge case, nevertheless I feel like a hint in the docs about this potential problem related to the Lag-limiting and HPA tolerance could be beneficial. |
Maybe we can improve the formula somehow to enforce that the difference is always above the tolerance, but I'm not sure if it's doable. Maybe calculating the tolerance somehow 🤔 |
I have an idea for how this could be solved, and I'm interested in implementing it if there is consensus that the new approach is worthwhile. Advantages:
Disadvantages:
My idea is that we could create an interface which scalers could choose to implement which allows the scalers to provide a Lines 293 to 295 in 9632275
would be changed to ask each trigger's scaler if its trigger has a maxReplicaCount and if so, the HPA would be given the lowest value provided (including the configured value from the scaledobject).
So, for example, imagine a scaled object with While it is a little confusing, I find the current method even more confusing, since the actual lag value gets masked when it exceeds the number of partitions/lag threshold. Not to mention the confusion of scenarios like the one reported here where it won't scale to the number of partitions without some hackery. Edited to fix one spot where I said |
I think that we will have to wait until the end of the summer because we are partially (or totally) signed off. We can discuss it on the next community standup in the worst case. FYI @kedacore/keda-core-contributors |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
Report
I recognized in various situations that if the KEDA Kafka scaler is already close to the maximum replica count, the scaler seems to stop working. It's not scaling up to the maximum replica count, even though lag is way above target.
Expected Behavior
Given 1 topic, 24 partitions, 24 max replicas and a lag threshold of 1000 messages: Maybe I'm missing out on something, but shouldn't the scaler hit the maximum of 24 replicas when the lag is at ~55k, given a threshold of 1k messages and 21 existing replicas?
While testing with other scaling mechanisms (e.g. CPU only) I already reached the limit of 24 replicas and the cluster has > 60% of resources left, just to be able to rule that out as a possible problems as well.
Actual Behavior
See this example:
At first, the scaler seems to work properly. Around 13:25 the amount of messages-in is increased, and lag starts to build up - so the scaler scales up and seems working correctly. At 13:35, lag again starts to rise and builds up till roundabout 55k without any scaling activity, although it should scale up to the maximum of 24 consumer replicas. It stays at 21 consumers until I killed my producer and the lag was decreasing towards 0 again.
Looking at the logs, it seems like the autoscaler just stopped working at some point. During the phase of not scaling up anymore, the only logs I could observe where like these:
So to me, there are no obvious errors but also no scaling activity takes place.
Opposed to this, in a phase where the scaler seems to work properly I also observe logs like this:
But in the phase of not scaling up anymore, there were none of the Reconciling-logs overall.
Steps to Reproduce the Problem
Suppose the following setup:
1 topic, 24 Partitions, maximum of 24 replicas, lag threshold of 1000 messages
My scaled object is configured like this:
If necessary, I can provide all other charts for this setup.
Logs from KEDA operator
No response
KEDA Version
2.11.1
Kubernetes Version
1.26
Platform
Google Cloud
Scaler Details
Kafka
Anything else?
No response
The text was updated successfully, but these errors were encountered: