You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observe performance degradation while scaling out together a large number of deployments, say N, via KEDA. We tested scaling behavior for number of scaledobjects, N = 100,200,500,1000,1500,2000.
We expect KEDA to scale deployment replicas from 0-->2 during activation window.
- In the below testing, only CRON based external scaler is being used to observed performance from scaling to/from 0 to constant Desired replicas count and vice versa.
- We notice that when number of ScaledObjects, N is 700<N<1250, it takes a significant amount of time to completely scale out all the target deployment replicas to come up to desired number of replicas (only 1->2 scaling). Approximately 2.5hrs.
- We see that KEDA is taking ~5mins to activate all ScaledObjects and bring replicas of all deployments from 0-->1,but its KEDA/HPA taking lot of time to scale the replicas form 1-->2.
NOTE:
- We have ensured that we have enough compute and all resourcequotas in surplus, to ensure that this is not a resource crunch.
- We have validated the behavior when N = 1500 or even 2000, all deployments are able to scale up within ~14mins - 15 mins which is expected considering the node scaleup and pod going to Running state.
- We only see this anomaly when number of scaledObjects and deployments were within 700 to 1250
Expected Behavior
- Every HPA object should make a call to the KEDA metricsapi server every 15s by default to fetch metrics starting from the CRON start window time.
- KEDA metricsapi server logs the request made by HPA, and make a call internally to the KEDA operator to get the actual external metric which is observed in the KEDA operator grpc logs.
- Finally the KEDA metricsapi server also logs when the metrics are successfully calculated and exposed by the KEDA operator.
- Every scaledObject should be reconciled every 30s by KEDA operator.
Actual Behavior
- Few of the HPAs are making a call to the KEDA metricsapi server after 2hr 30mins to fetch metrics after the CRON start window time.
- We see a latency of around 1 min during the external metric generation and exposing by handshake between KEDA operator and the KEDA metricsapi server.
- We observe pressure in KEDA operator where we see the reconciliation or polling activity by KEDA operator taking >30s every poll.
Steps to Reproduce the Problem
Create the below Scaledobject targeting a simple deployment having one container.
`#Scaleobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: app-scaledobjecttxy-10
namespace: test-ns
spec:
scaleTargetRef:
name: app-deployedtxy-10
minReplicaCount: 0
advanced:
restoreToOriginalReplicaCount: true
triggers:
- type: cron
metadata:
timezone: Asia/Kolkata
start: 00 14 * * * # At every 2pm IST
end: 00 19 * * * # At every 7pm IST
desiredReplicas: "2"
name: "cron-sample"
`
2. We need to create N number of scaledobjects , where number of scaledobjects/deployments in this case N = 1050. (We saw any value between 700 and 1250 was showing this behavior and can be used for reproducing this bug.)
3. Please make sure there is no resource crunch while scaling and make sure we have enough compute for all 1050 deployments scaling up to 2 each (worker nodes and compute with surplus namespace resourcequota)
Logs from KEDA operator
CRON window timing -
Start : 2024-08-05T14:00:00.000+05:30
End : 2024-08-05T19:00:00.000+05:30
We can see the first request logged at 2024-08-05T16:33:19.216+05:30 for a scaled object with issue : app-scaledobjecttxy-10
[keda-operator-reconcile-logs.json](https://github.com/user-attachments/files/16579557/keda-operator-reconcile-logs.json)
[keda-operator-logs.csv](https://github.com/user-attachments/files/16579559/keda-operator-logs.csv)
[keda-metricsapi-server-logs.csv](https://github.com/user-attachments/files/16579560/keda-metricsapi-server-logs.csv)
KEDA Version
2.13.1
Kubernetes Version
1.28
Platform
Amazon Web Services
Scaler Details
CRON
Anything else?
No response
The text was updated successfully, but these errors were encountered:
Hello,
At scale, there are 2 configurations that can be affecting you, creating the bottleneck:
Parallel reconciliations
Kubernetes client throttling
For the parallel topic, I'd suggest increasing the current value of KEDA_SCALEDOBJECT_CTRL_MAX_RECONCILES 5 to IDK, 20 (and check if it improves and solves, if only improves, increase more) -> https://keda.sh/docs/2.15/operate/cluster/#configure-maxconcurrentreconciles-for-controllers. This will allow more parallel actions reconilling ScaledObjects (if this is the bottleneck)
For the Kubernetes client throttling, you can increase these other paramenters -> https://keda.sh/docs/2.15/operate/cluster/#kubernetes-client-parameters
If you are affected by this, you should see messages announcing the rate limit and the waiting time due to it. In this case, I'd recommend increasing them to the double and monitor how it performs, if it's not enough, multiple to the double and check and so on...
There have also been some improvements related with status handling, so upgrading to v2.15 could improve the performance as it reduces significantly the calls to the API server in some cases (if this is the root cause of your case)
Report
We observe performance degradation while scaling out together a large number of deployments, say N, via KEDA. We tested scaling behavior for number of scaledobjects, N = 100,200,500,1000,1500,2000.
We expect KEDA to scale deployment replicas from 0-->2 during activation window.
- In the below testing, only CRON based external scaler is being used to observed performance from scaling to/from 0 to constant Desired replicas count and vice versa.
- We notice that when number of ScaledObjects, N is 700<N<1250, it takes a significant amount of time to completely scale out all the target deployment replicas to come up to desired number of replicas (only 1->2 scaling). Approximately 2.5hrs.
- We see that KEDA is taking ~5mins to activate all ScaledObjects and bring replicas of all deployments from 0-->1,but its KEDA/HPA taking lot of time to scale the replicas form 1-->2.
NOTE:
- We have ensured that we have enough compute and all resourcequotas in surplus, to ensure that this is not a resource crunch.
- We have validated the behavior when N = 1500 or even 2000, all deployments are able to scale up within ~14mins - 15 mins which is expected considering the node scaleup and pod going to Running state.
- We only see this anomaly when number of scaledObjects and deployments were within 700 to 1250
Expected Behavior
Actual Behavior
Steps to Reproduce the Problem
`#Scaleobject.yaml
Deployment.yaml
`
2. We need to create N number of scaledobjects , where number of scaledobjects/deployments in this case N = 1050. (We saw any value between 700 and 1250 was showing this behavior and can be used for reproducing this bug.)
3. Please make sure there is no resource crunch while scaling and make sure we have enough compute for all 1050 deployments scaling up to 2 each (worker nodes and compute with surplus namespace resourcequota)
Logs from KEDA operator
CRON window timing -
Start : 2024-08-05T14:00:00.000+05:30
End : 2024-08-05T19:00:00.000+05:30
KEDA Version
2.13.1
Kubernetes Version
1.28
Platform
Amazon Web Services
Scaler Details
CRON
Anything else?
No response
The text was updated successfully, but these errors were encountered: