KEDA Unable to Retrieve correct Kafka Metrics from ScaledObject on GKE #5730

converge · 2024-04-23T13:11:13Z

Report

KEDA is unable to retrieve metrics correctly from a ScaledObject/ScaleTarget using a Kafka trigger when deployed to a GKE cluster (It works locally)

Expected Behavior

When HPA calculates the current metric value, it should not return -9223372036854775808m, but a valid Kafka lag.

Actual Behavior

When the Kafka ScaledObject is deployed to GKE:

The HorizontalPodAutoscaler reports a negative metric value (9223372036854775808m)

Steps to Reproduce the Problem

deploy Keda in a GKE cluster
create a ScaledObject with Kafka Source as Scaled Target
kubectl get hpa -A (will show the metric with the negative value)

Logs from KEDA operator

There is no error or warning in the Keda operator.

KEDA Version

2.13.1

Kubernetes Version

1.27

Platform

Google Cloud

Scaler Details

Kafka

Anything else?

No response

The text was updated successfully, but these errors were encountered:

SpiritZhou · 2024-04-24T05:08:29Z

Could you change the log-level to debug in operator and send the operator logs?

dttung2905 · 2024-05-06T21:46:20Z

In addition to the log, could you provide the ScaledObject yaml config too ?

converge · 2024-06-05T19:56:40Z

sorry for the delay, I still have this issue and created a new GCP/GKE cluster specially to debug it. I was able to reproduce the issue and got the logs from the exact moment where the current metric value switch in the HPA from <unknown> to -9223372036854775808m.

the logs from the controller when this switch happened:

2024-06-05T19:43:42Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:43:42Z	DEBUG	scale_handler	Getting metrics and activity from scaler	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:43:42Z","value":"1"}], "activity": true, "scalerError": null}

2024-06-05T19:43:42Z	DEBUG	scale_handler	Scaler for scaledObject is active	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:43:42Z	INFO	scaleexecutor	Successfully updated ScaleTarget	{"scaledobject.Name": "kafka-scaledobject", "scaledObject.Namespace": "app", "scaleTarget.Name": "kafka-source", "Original Replicas Count": 0, "New Replicas Count": 1}

2024-06-05T19:43:42Z	DEBUG	events	Scaled sources.knative.dev/v1beta1.kafkasource app/kafka-source from 0 to 1	{"type": "Normal", "object": {"kind":"ScaledObject","namespace":"app","name":"kafka-scaledobject","uid":"01537a1d-1543-48ea-adce-e7c3069fff86","apiVersion":"keda.sh/v1alpha1","resourceVersion":"314645"}, "reason": "KEDAScaleTargetActivated"}
2024-06-05T19:44:12Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000a1b400]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:12Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:12Z	DEBUG	scale_handler	Getting metrics and activity from scaler	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:12Z","value":"1"}], "activity": true, "scalerError": null}
2024-06-05T19:44:12Z	DEBUG	scale_handler	Scaler for scaledObject is active	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:44:42Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000ae9400]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:42Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:42Z	DEBUG	scale_handler	Getting metrics and activity from scaler	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:42Z","value":"1"}], "activity": true, "scalerError": null}
2024-06-05T19:44:42Z	DEBUG	scale_handler	Scaler for scaledObject is active	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:44:52Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000460050]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:52Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}

2024-06-05T19:44:52Z	DEBUG	scale_handler	Getting metrics from trigger	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "trigger": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:52Z","value":"1"}], "scalerError": null}
2024-06-05T19:44:52Z	DEBUG	grpc_server	Providing metrics	{"scaledObjectName": "kafka-scaledobject", "scaledObjectNamespace": "app", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-knative-group-topics,MetricLabels:map[string]string{},Timestamp:2024-06-05 19:44:52.805581005 +0000 UTC m=+3545.120570672,WindowSeconds:nil,Value:{{1000 -3} {<nil>}  DecimalSI},},},}"}
2024-06-05T19:45:08Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000550c30]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:45:08Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:45:08Z	DEBUG	scale_handler	Getting metrics from trigger	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "trigger": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:45:08Z","value":"1"}], "scalerError": null}
2024-06-05T19:45:08Z	DEBUG	grpc_server	Providing metrics	{"scaledObjectName": "kafka-scaledobject", "scaledObjectNamespace": "app", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-knative-group-topics,MetricLabels:map[string]string{},Timestamp:2024-06-05 19:45:08.16711765 +0000 UTC m=+3560.482107310,WindowSeconds:nil,Value:{{1000 -3} {<nil>}  DecimalSI},},},}"}

the scaled object:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject
  namespace: app
  annotations:
    scaledobject.keda.sh/transfer-hpa-ownership: "true"
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      name: notify-friends
  scaleTargetRef:
    apiVersion: sources.knative.dev/v1beta1
    name: kafka-source
    kind: kafkasource
  pollingInterval: 30
  minReplicaCount: 1
  maxReplicaCount: 11
  idleReplicaCount: 0
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: pkc-<...>.gcp.confluent.cloud:9092
        consumerGroup: knative-group
        topic:
        lagThreshold: '60'
        offsetResetPolicy: earliest
        scaleToZeroOnInvalidOffset: "false"
        tls: enable
        sasl: plaintext
      authenticationRef:
        name: keda-kafka-credentials

converge · 2024-06-05T21:04:44Z

found people with similar issue https://kubernetes.slack.com/archives/CKZJ36A5D/p1709761505122509

converge · 2024-06-07T13:46:11Z

@SpiritZhou @dttung2905 yesterday I created an EKS cluster with the same GCP setup/versions, and it works perfectly. Can you think of anything that could be different for GCP and AWS? any kind of blocker or anything that could be causing the issue?

I have also followed the k8s events and couldn't find anything bad in GCP.
Wanted to look for horizontal pod autoscaler logs, but couldn't find them.

converge · 2024-06-13T12:40:24Z

I did some more tests, I believe kafka connection is ok, I was able to produce and consume messages inside the pod using Go Sarama library (the same knative kafka extension library).

I created a debian pod in Kubernetes/GCP/GKE, attached to it and:

./kafka-consumer-groups.sh --bootstrap-server pkc-<...>.<...>.gcp.confluent.cloud:9092 --command-config config.properties --describe --group testing-group

Consumer group 'testing-group' has no active members.

# no messages produced

GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
testing-group   knative-demo-topic 1          1               1               0               -               -               -
testing-group   knative-demo-topic 8          1               1               0               -               -               -
testing-group   knative-demo-topic 22         15              15              0               -               -               -
testing-group   knative-demo-topic 29         1               1               0               -               -               -

# no messages consumed
GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
testing-group   knative-demo-topic 1          1               30              29              -               -               -
testing-group   knative-demo-topic 8          1               36              35              -               -               -
testing-group   knative-demo-topic 22         15              54              39              -               -               -
testing-group   knative-demo-topic 29         1               38              37              -               -               -

# after messages being consumed
GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
testing-group   knative-demo-topic 1          30              30              0               -               -               -
testing-group   knative-demo-topic 8          36              36              0               -               -               -
testing-group   knative-demo-topic 22         54              54              0               -               -               -
testing-group   knative-demo-topic 29         38              38              0               -               -               -

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/app/s0-kafka-knative-group-topics?labelSelector=scaledobject.keda.sh%2Fname%3Dkafka-scaledobject"
{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-13T12:26:02Z",
---> "value":"1009"}]} <---

 ~/tmp/sandbox  🏖️ sandbox  k get hpa -n app                                                                                                                                                
NAME                  REFERENCE                                   TARGETS                          MINPODS   MAXPODS      REPLICAS   AGE
notify-friends        kafkasource/kafka-source                    -9223372036854775808m/60 (avg)   1         11           11         3m1s

another thing I noticed/not sure if relevant is that the metric is not listed here:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq                                                                                                  
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "externalmetrics",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

but I can query it, as it shows above.

zroubalik · 2024-06-25T20:44:56Z

You should try to query the specific metric for the ScaledObject, see the examples down below: https://keda.sh/docs/2.14/operate/metrics-server/#querying-metrics-exposed-by-keda-metrics-server

converge · 2024-06-26T11:52:32Z

thanks @zroubalik , Im able to query the metric, and I can also see the metric value via open telemetry/datadog, the metric value is correct/the expected value, and not the unexpected -9223372036854775808m value.

pstibrany · 2024-07-24T11:33:28Z

-9223372036854775808m value is a result of converting +Inf to int64.

This code:

	fmt.Println("+Inf", int64(math.Inf(1)))
	fmt.Println("-Inf", int64(math.Inf(-1)))
	fmt.Println("NaN", int64(math.NaN()))

prints

+Inf 9223372036854775807
-Inf -9223372036854775808
NaN 0

on Apple M1. On same Apple M1, but when compiling with GOARCH=amd64, it prints:

+Inf 9223372036854775807
-Inf -9223372036854775808
NaN -9223372036854775808

pstibrany · 2024-07-24T11:41:19Z

There are places in HPA controller where such conversion is happening. One case that I was investigating is when using KEDA to target custom resource, with TargetType AverageValue. In such case, the following method is responsible for computing usage:

https://github.com/kubernetes/kubernetes/blob/7f805dc9354b5209136d72359e3c019885135ce7/pkg/controller/podautoscaler/replica_calculator.go#L351C1-L376C2

If statusReplicas (read from /status subresource of target resource) is 0, then computed usage will be wrong (-9223372036854775808m): https://github.com/kubernetes/kubernetes/blob/a145f1508d2180c93ba292849f0bf4bf6120fae8/pkg/controller/podautoscaler/replica_calculator.go#L374 (notice conversion from float64 to int64 here)

This usage value is then stored into HPA's metric status: https://github.com/kubernetes/kubernetes/blob/985ef265511370655452157fb5c583d5bf7a45e4/pkg/controller/podautoscaler/horizontal.go#L687-L703

On EKS, when we fixed status.replicas on the target resource, things started to work correctly for us. On GKE, we still see -9223372036854775808m value, and don't understand why.

Zurvarian · 2024-09-06T09:46:08Z

I've followed up on this issue.

We've seen the actual calculation of the scaler to be correct. However HPA is displaying the values wrong when queried.

@pstibrany As you pointed out the issue seems to be how the usage is calculated. I still have pending to try to use Value instead of AverageValue as the former won't do any calculation with the metric gotten from the Kafka Cluster. If then the value is displayed correctly we would have found the reason.

Looking at the source code in the method it turns out that even if number of replicas is 0, the replicas calculation is correct because in the end the statusReplicas is not use to calculate the replicas only to prevent increasing the replica count if the new value grew below the toleration.
And as toleration is calculated by 1-(-9223372036854775808) the result is a positive number that is always over the toleration threshold 🤣.

Knowing this is the root cause granted me peace of mind, thank you!

Now the remaining question is why scale.Status.Replicas is 0 in GCP, but that's another path to investigate.

That's assuming that GCP did not deploy their own custom version of the Autoscaler and that introduced something else that could be messing around.

stale · 2024-11-07T02:06:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

zroubalik · 2024-11-11T19:51:08Z

I think this issue might be worth documenting

MaxWinterstein · 2024-11-29T14:54:15Z

I opened a case for this / my related discussion #6375

Case 55415916: HPA scaler has huge delay when coming from 0 replica deployment

Let's see if this get us somewhere...

zroubalik · 2024-12-18T17:47:57Z

@MaxWinterstein did a great job in opening the case and actually getting Google people on it 🚀 😅 He preared a very good and easy to understand reproducer of the problem and was that kind to allow me to join the conversation with Google 🥇
Kudos to him!

I am attaching some more details I was able to find out about this, it is reported to Google, so now we have to wait if they are gonna do something about it 🤔

I’ve confirmed that the issue isn’t related to KEDA or the metrics delivery pipeline between the Kubernetes API server (via the external.metrics endpoint) and the HPA controller.

Instead, the problem appears to originate within the HPA controller itself. Specifically, it takes an unusually long time to recognize changes in the target workload’s replica count during transitions between zero and one replica (both 0->1 and 1->0). This is the root of the problem.

The HPA controller’s documented behavior is to ignore scaling if the workload’s replicas fall below the defined minimum. In this case, the HPA minimum is set to one, while the workload is temporarily scaled down to zero. This is expected and works correctly. Once the workload scales back up to one replica (as triggered by KEDA), the HPA should resume normal scaling operations. However, the controller does not immediately detect that the workload has reached the minimum replica count, delaying further scaling actions. While this recognition typically occurs within seconds (~10s) in non-GKE environments, on GKE clusters it can take several minutes before the HPA reflects the correct replica count.

The attached screenshots illustrate this behavior, the workload clearly shows one replica, yet the HPA still reports zero.

Similarly, when scaling down from multiple replicas to zero, the workload is at zero replicas, but the HPA continues to report the old (5) replica count.

Once the HPA controller get the correct number of replicas the scaling performs as expected.

converge added the bug Something isn't working label Apr 23, 2024

JoelDimbernat mentioned this issue Jun 25, 2024

GCP Pub/Sub Scaler reports negative metric values #5774

Closed

stale bot added the stale All issues that are marked as stale due to inactivity label Nov 7, 2024

zroubalik removed the stale All issues that are marked as stale due to inactivity label Nov 11, 2024

MaxWinterstein mentioned this issue Nov 28, 2024

The scaled workload does not function properly when the minimum replica count is set to 0 #6374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEDA Unable to Retrieve correct Kafka Metrics from ScaledObject on GKE #5730

KEDA Unable to Retrieve correct Kafka Metrics from ScaledObject on GKE #5730

converge commented Apr 23, 2024

SpiritZhou commented Apr 24, 2024

dttung2905 commented May 6, 2024

converge commented Jun 5, 2024

converge commented Jun 5, 2024

converge commented Jun 7, 2024

converge commented Jun 13, 2024

zroubalik commented Jun 25, 2024

converge commented Jun 26, 2024

pstibrany commented Jul 24, 2024 •

edited

Loading

pstibrany commented Jul 24, 2024 •

edited

Loading

Zurvarian commented Sep 6, 2024

stale bot commented Nov 7, 2024

zroubalik commented Nov 11, 2024

MaxWinterstein commented Nov 29, 2024

zroubalik commented Dec 18, 2024 •

edited

Loading

KEDA Unable to Retrieve correct Kafka Metrics from ScaledObject on GKE #5730

KEDA Unable to Retrieve correct Kafka Metrics from ScaledObject on GKE #5730

Comments

converge commented Apr 23, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

SpiritZhou commented Apr 24, 2024

dttung2905 commented May 6, 2024

converge commented Jun 5, 2024

converge commented Jun 5, 2024

converge commented Jun 7, 2024

converge commented Jun 13, 2024

zroubalik commented Jun 25, 2024

converge commented Jun 26, 2024

pstibrany commented Jul 24, 2024 • edited Loading

pstibrany commented Jul 24, 2024 • edited Loading

Zurvarian commented Sep 6, 2024

stale bot commented Nov 7, 2024

zroubalik commented Nov 11, 2024

MaxWinterstein commented Nov 29, 2024

zroubalik commented Dec 18, 2024 • edited Loading

pstibrany commented Jul 24, 2024 •

edited

Loading

pstibrany commented Jul 24, 2024 •

edited

Loading

zroubalik commented Dec 18, 2024 •

edited

Loading