-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEDA Unable to Retrieve correct Kafka Metrics from ScaledObject on GKE #5730
Comments
Could you change the log-level to debug in operator and send the operator logs? |
In addition to the log, could you provide the |
sorry for the delay, I still have this issue and created a new GCP/GKE cluster specially to debug it. I was able to reproduce the issue and got the logs from the exact moment where the current metric value switch in the HPA from the logs from the controller when this switch happened:
the scaled object:
|
found people with similar issue https://kubernetes.slack.com/archives/CKZJ36A5D/p1709761505122509 |
@SpiritZhou @dttung2905 yesterday I created an EKS cluster with the same GCP setup/versions, and it works perfectly. Can you think of anything that could be different for GCP and AWS? any kind of blocker or anything that could be causing the issue? I have also followed the k8s events and couldn't find anything bad in GCP. |
I did some more tests, I believe kafka connection is ok, I was able to produce and consume messages inside the pod using Go Sarama library (the same knative kafka extension library). I created a debian pod in Kubernetes/GCP/GKE, attached to it and:
another thing I noticed/not sure if relevant is that the metric is not listed here:
but I can query it, as it shows above. |
You should try to query the specific metric for the ScaledObject, see the examples down below: https://keda.sh/docs/2.14/operate/metrics-server/#querying-metrics-exposed-by-keda-metrics-server |
thanks @zroubalik , Im able to query the metric, and I can also see the metric value via open telemetry/datadog, the metric value is correct/the expected value, and not the unexpected |
This code: fmt.Println("+Inf", int64(math.Inf(1)))
fmt.Println("-Inf", int64(math.Inf(-1)))
fmt.Println("NaN", int64(math.NaN())) prints +Inf 9223372036854775807
-Inf -9223372036854775808
NaN 0 on Apple M1. On same Apple M1, but when compiling with
|
There are places in HPA controller where such conversion is happening. One case that I was investigating is when using KEDA to target custom resource, with TargetType If This On EKS, when we fixed status.replicas on the target resource, things started to work correctly for us. On GKE, we still see |
I've followed up on this issue. We've seen the actual calculation of the scaler to be correct. However HPA is displaying the values wrong when queried. @pstibrany As you pointed out the issue seems to be how the usage is calculated. I still have pending to try to use Looking at the source code in the method it turns out that even if number of replicas is 0, the replicas calculation is correct because in the end the Knowing this is the root cause granted me peace of mind, thank you! Now the remaining question is why That's assuming that GCP did not deploy their own custom version of the Autoscaler and that introduced something else that could be messing around. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
I think this issue might be worth documenting |
I opened a case for this / my related discussion #6375 Case 55415916: HPA scaler has huge delay when coming from 0 replica deployment Let's see if this get us somewhere... |
@MaxWinterstein did a great job in opening the case and actually getting Google people on it 🚀 😅 He preared a very good and easy to understand reproducer of the problem and was that kind to allow me to join the conversation with Google 🥇 I am attaching some more details I was able to find out about this, it is reported to Google, so now we have to wait if they are gonna do something about it 🤔 I’ve confirmed that the issue isn’t related to KEDA or the metrics delivery pipeline between the Kubernetes API server (via the external.metrics endpoint) and the HPA controller. Instead, the problem appears to originate within the HPA controller itself. Specifically, it takes an unusually long time to recognize changes in the target workload’s replica count during transitions between zero and one replica (both 0->1 and 1->0). This is the root of the problem. The HPA controller’s documented behavior is to ignore scaling if the workload’s replicas fall below the defined minimum. In this case, the HPA minimum is set to one, while the workload is temporarily scaled down to zero. This is expected and works correctly. Once the workload scales back up to one replica (as triggered by KEDA), the HPA should resume normal scaling operations. However, the controller does not immediately detect that the workload has reached the minimum replica count, delaying further scaling actions. While this recognition typically occurs within seconds (~10s) in non-GKE environments, on GKE clusters it can take several minutes before the HPA reflects the correct replica count. The attached screenshots illustrate this behavior, the workload clearly shows one replica, yet the HPA still reports zero. Similarly, when scaling down from multiple replicas to zero, the workload is at zero replicas, but the HPA continues to report the old (5) replica count. Once the HPA controller get the correct number of replicas the scaling performs as expected. |
Report
KEDA is unable to retrieve metrics correctly from a ScaledObject/ScaleTarget using a Kafka trigger when deployed to a GKE cluster (It works locally)
Expected Behavior
When HPA calculates the current metric value, it should not return
-9223372036854775808m
, but a valid Kafka lag.Actual Behavior
When the Kafka ScaledObject is deployed to GKE:
Steps to Reproduce the Problem
Logs from KEDA operator
There is no error or warning in the Keda operator.
KEDA Version
2.13.1
Kubernetes Version
1.27
Platform
Google Cloud
Scaler Details
Kafka
Anything else?
No response
The text was updated successfully, but these errors were encountered: