Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEDA Kafka-Scaler not scaling out anymore if close to max replica count #4791

Closed
jnsrnhld opened this issue Jul 13, 2023 · 10 comments
Closed
Labels
bug Something isn't working help wanted Looking for support from community stale All issues that are marked as stale due to inactivity

Comments

@jnsrnhld
Copy link

jnsrnhld commented Jul 13, 2023

Report

I recognized in various situations that if the KEDA Kafka scaler is already close to the maximum replica count, the scaler seems to stop working. It's not scaling up to the maximum replica count, even though lag is way above target.

Expected Behavior

Given 1 topic, 24 partitions, 24 max replicas and a lag threshold of 1000 messages: Maybe I'm missing out on something, but shouldn't the scaler hit the maximum of 24 replicas when the lag is at ~55k, given a threshold of 1k messages and 21 existing replicas?

While testing with other scaling mechanisms (e.g. CPU only) I already reached the limit of 24 replicas and the cluster has > 60% of resources left, just to be able to rule that out as a possible problems as well.

Actual Behavior

See this example:

image

At first, the scaler seems to work properly. Around 13:25 the amount of messages-in is increased, and lag starts to build up - so the scaler scales up and seems working correctly. At 13:35, lag again starts to rise and builds up till roundabout 55k without any scaling activity, although it should scale up to the maximum of 24 consumer replicas. It stays at 21 consumers until I killed my producer and the lag was decreasing towards 0 again.

Looking at the logs, it seems like the autoscaler just stopped working at some point. During the phase of not scaling up anymore, the only logs I could observe where like these:

keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:06Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:48:06.360630599 +0000 UTC m=+4154.130163483,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:21Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 42319, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:21Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:48:21Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:21Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:48:21.414456631 +0000 UTC m=+4169.183989516,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:23Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 43084, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:23Z    DEBUG    scale_handler    Getting metrics and activity from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:48:23Z","value":"23k"}], "activity": true, "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:23Z    DEBUG    scale_handler    Scaler for scaledObject is active    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:36Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 42832, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:36Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:48:36Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:36Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:48:36.449366413 +0000 UTC m=+4184.218899296,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:51Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 43729, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:51Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:48:51Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:51Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:48:51.487414238 +0000 UTC m=+4199.256947124,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:53Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 43697, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:53Z    DEBUG    scale_handler    Getting metrics and activity from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:48:53Z","value":"23k"}], "activity": true, "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:48:53Z    DEBUG    scale_handler    Scaler for scaledObject is active    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:06Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 43734, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:06Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:49:06Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:06Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:49:06.515993357 +0000 UTC m=+4214.285526243,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:21Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 45153, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:21Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:49:21Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:21Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:49:21.560601767 +0000 UTC m=+4229.330134671,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:23Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 45741, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:23Z    DEBUG    scale_handler    Getting metrics and activity from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:49:23Z","value":"23k"}], "activity": true, "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:23Z    DEBUG    scale_handler    Scaler for scaledObject is active    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:36Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 45915, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:36Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:49:36Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:36Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:49:36.591286313 +0000 UTC m=+4244.360819197,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:51Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 47245, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:51Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:49:51Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:51Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:49:51.639204493 +0000 UTC m=+4259.408737377,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:53Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 47508, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:53Z    DEBUG    scale_handler    Getting metrics and activity from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:49:53Z","value":"23k"}], "activity": true, "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:49:53Z    DEBUG    scale_handler    Scaler for scaledObject is active    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:06Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 47756, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:06Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:50:06Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:06Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:50:06.670545366 +0000 UTC m=+4274.440078250,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:21Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 48376, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:21Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:50:21Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:21Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:50:21.730253149 +0000 UTC m=+4289.499786044,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:23Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 48535, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:23Z    DEBUG    scale_handler    Getting metrics and activity from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:50:23Z","value":"23k"}], "activity": true, "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:23Z    DEBUG    scale_handler    Scaler for scaledObject is active    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:36Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 50340, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:36Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:50:36Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:36Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:50:36.77189755 +0000 UTC m=+4304.541430437,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:51Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 52250, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:51Z    DEBUG    scale_handler    Getting metrics from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:50:51Z","value":"23k"}], "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:51Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:50:51.838407301 +0000 UTC m=+4319.607940185,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:53Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 52754, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:53Z    DEBUG    scale_handler    Getting metrics and activity from scaler    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1", "metrics": [{"metricName":"s0-kafka-topic1","metricLabels":null,"timestamp":"2023-07-13T11:50:53Z","value":"23k"}], "activity": true, "scalerError": null}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:53Z    DEBUG    scale_handler    Scaler for scaledObject is active    {"scaledObject.Namespace": "kafka", "scaledObject.Name": "consumerapp-scaled-object", "scaler": "kafkaScaler", "metricName": "s0-kafka-topic1"}
keda-operator-77f866d85b-l7hgz 2023-07-13T11:51:06Z    DEBUG    kafka_scaler    Kafka scaler: Providing metrics based on totalLag 54271, topicPartitions 1, threshold 1000    {"type": "ScaledObject", "namespace": "kafka", "name": "consumerapp-scaled-object"}

So to me, there are no obvious errors but also no scaling activity takes place.
Opposed to this, in a phase where the scaler seems to work properly I also observe logs like this:

keda-operator-77f866d85b-l7hgz 2023-07-13T12:03:54Z    INFO    Reconciling ScaledObject    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"consumerapp-scaled-object","namespace":"kafka"}, "namespace": "kafka", "name": "consumerapp- │

But in the phase of not scaling up anymore, there were none of the Reconciling-logs overall.

Steps to Reproduce the Problem

Suppose the following setup:
1 topic, 24 Partitions, maximum of 24 replicas, lag threshold of 1000 messages

My scaled object is configured like this:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: release-name-consumerapp-scaled-object
  labels:
    helm.sh/chart: consumerapp-0.1.0
    app.kubernetes.io/name: consumerapp
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/version: "0.1.12"
    app.kubernetes.io/managed-by: Helm
  annotations:
    scaledobject.keda.sh/transfer-hpa-ownership: "true"
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: release-name-consumerapp
  pollingInterval:  30
  cooldownPeriod:   300
  idleReplicaCount: 
  minReplicaCount: 1
  maxReplicaCount: 24
  triggers:
    - metadata:
        bootstrapServers: cluster-kafka-bootstrap.kafka:9092
        consumerGroup: group1
        lagThreshold: "1000"
        offsetResetPolicy: latest
        partitionLimitation: 1-24
        topic: topic1
      type: kafka
  advanced:
    horizontalPodAutoscalerConfig:
      name: consumerapp
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 90
          policies:
            - periodSeconds: 45
              type: Pods
              value: 2
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - periodSeconds: 120
              type: Percent
              value: 100
            - periodSeconds: 60
              type: Pods
              value: 4

If necessary, I can provide all other charts for this setup.

Logs from KEDA operator

No response

KEDA Version

2.11.1

Kubernetes Version

1.26

Platform

Google Cloud

Scaler Details

Kafka

Anything else?

No response

@jnsrnhld jnsrnhld added the bug Something isn't working label Jul 13, 2023
@JorTurFer
Copy link
Member

Hi!
I have the feeling that the issue is related with the partition filtering (partitionLimitation). Are you sure that those are the partitions ids? Could you try removing that parameter?

@jnsrnhld
Copy link
Author

Hi,
thank you for the fast answer! I already did a run without using this parameter at all, result is more or less the same, although this time it scaled up to 22 replicas and stopped again:
Bildschirmfoto 2023-07-14 um 20 00 15

I did another test run with the doubled amount of partitions (48). Now the scaler scaled up to the max replica count of 24 and everything works as intended.

So it seems like it's somehow related to the amount of partitions and the issue occurs only if the amount of replicas is close to the amount of partitions?

@JorTurFer
Copy link
Member

JorTurFer commented Jul 18, 2023

Are you scrapping KEDA metrics by chance? If yes, you could see what KEDA is seeing from kafka (keda_scaler_metrics_value). Could you share it too? That's the raw value that KEDA exposes to the HPA controller.
Based of the previous logs like this:

keda-operator-77f866d85b-l7hgz 2023-07-13T11:50:21Z    DEBUG    grpc_server    Providing metrics    {"scaledObjectName": "consumerapp-scaled-object", "scaledObjectNamespace": "kafka", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-topic1,MetricLabels:map[string]string{},Timestamp:2023-07-13 11:50:21.730253149 +0000 UTC m=+4289.499786044,WindowSeconds:nil,Value:{{23000000 -3} {<nil>}  DecimalSI},},},}"}

In the other hand, KEDA is exposing 23000 several times in a row, which is too "round" value and looks like if has 23 "available partitions" in kafka (maybe due to an error getting them) and that's why the scaler limits the messages.
Could you give another try, setting the parameter allowIdleConsumers: true?
image

If I'm right with my suspicious, now it will work because the scaler won't limit the result to available partitions

@jnsrnhld
Copy link
Author

Very interesting: I tried setting allowIdleConsumers to true, it indeed it scaled up to the max replicas/max partitions of 24.

I did another run without that option and monitored the keda_scaler_metrics_value. The value has always been limited when the lag was above 24k messages (see the blue line in the upper chart). The maximum amount of replicas was stuck again at 22 or 23.

image

And as you already mentioned, this makes sense looking at the following lines https://github.com/kedacore/keda/blob/main/pkg/scalers/kafka_scaler.go#L660. So, if allowIdleConsumers is deactivated, the reported lag is getting limited and due to the HPA's 0.1 tolerance when calculating the amount of replicas, when the amount of replicas is already close to max replicas, no scale up happens at all.

For example: 22 current replicas * (24k actual lag / 22k desired lag) = 1.09, which is below the HPA tolerance, so no scale up to 24 replicas happens.

So, if this is the way it's supposed to work, I guess we can close this ticket. Personally speaking, this feels confusing because I don't want to allowIdleConsumers, nor does it feel right to double the partitions to prevent this behavior. I'd expect the scaler to scale at least to the max amount of replicas with respect to the amount of partitions, which seems to be not possible in this scenario.

I know my scenario is probably a very rare edge case, nevertheless I feel like a hint in the docs about this potential problem related to the Lag-limiting and HPA tolerance could be beneficial.

@JorTurFer
Copy link
Member

Maybe we can improve the formula somehow to enforce that the difference is always above the tolerance, but I'm not sure if it's doable. Maybe calculating the tolerance somehow 🤔
@zroubalik , WDYT?

@joelsmith
Copy link
Contributor

joelsmith commented Jul 21, 2023

I have an idea for how this could be solved, and I'm interested in implementing it if there is consensus that the new approach is worthwhile.

Advantages:

  • keda_scaler_metrics_value will show the actual value used for scaling rather than a clamped value in the !allowIdleConsumers case (less confusing for users)
  • HPA will be able to scale to the number of partitions (or to the max replica value in the case of other scalers who implement this idea)

Disadvantages:

  • The max replicas shown in the HPA might not match maxReplicaCount on the scaled object, potentially causing confusion (which would need to be addressed in docs)

My idea is that we could create an interface which scalers could choose to implement which allows the scalers to provide a maxReplicaCount if they want to. So this code:

if scaledObject.Spec.MaxReplicaCount != nil {
return *scaledObject.Spec.MaxReplicaCount
}

would be changed to ask each trigger's scaler if its trigger has a maxReplicaCount and if so, the HPA would be given the lowest value provided (including the configured value from the scaledobject).

So, for example, imagine a scaled object with maxReplicaCount of 10 and a kafka trigger with a lagThreshold of 1, with allowIdleConsumers: false and aimed at a topic with only 5 partitions. The HPA would get configured with maxReplicaCount of 5 even though the scaled object has a higher value. That way if the lag got really huge (like, say, 1,000), you would see that value propagate through the system and HPA would show that value and that it wants to scale up, but can't because of the maxReplicaCount of 5, which came from the kafka scaler's limit based upon the number of partitions.

While it is a little confusing, I find the current method even more confusing, since the actual lag value gets masked when it exceeds the number of partitions/lag threshold. Not to mention the confusion of scenarios like the one reported here where it won't scale to the number of partitions without some hackery.

Edited to fix one spot where I said minReplicaCount instead of maxReplicaCount by mistake

@JorTurFer
Copy link
Member

I have an idea for how this could be solved, and I'm interested in implementing it if there is consensus that the new approach is worthwhile.

I think that we will have to wait until the end of the summer because we are partially (or totally) signed off. We can discuss it on the next community standup in the worst case.

FYI @kedacore/keda-core-contributors

@tomkerkhove tomkerkhove changed the title KEDA Kafka-Scaler not scaling up anymore if close to max replica count KEDA Kafka-Scaler not scaling out anymore if close to max replica count Aug 17, 2023
@tomkerkhove tomkerkhove added the help wanted Looking for support from community label Aug 17, 2023
@stale
Copy link

stale bot commented Oct 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 16, 2023
@zroubalik zroubalik removed the stale All issues that are marked as stale due to inactivity label Oct 16, 2023
Copy link

stale bot commented Dec 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Dec 15, 2023
Copy link

stale bot commented Dec 23, 2023

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Looking for support from community stale All issues that are marked as stale due to inactivity
Projects
Archived in project
Development

No branches or pull requests

5 participants