-
Notifications
You must be signed in to change notification settings - Fork 74
Adjust broker ingress resource requirements to meet default events per second target #1600
Comments
To support the target load, here are proposed values:
Here is the reasoning. CPUBetween 1000-2000 qps, the CPU ranges between 1800m to 3000m. Event size has very little impact on the CPU usage. Setting 2000m CPU allow HPA to kick in near 1000 qps. MemWith 1000 qps and 256k event size, the stable memory usage is about ranges between 300Mi to 1Gi with most of the time stay below 500Mi. However, when there is sudden increase on qps or event size, there is always a surge of memory usage which gradually smooths out. The surge sometimes could reach 2Gi. Setting the memory limit to 2Gi helps prevent OOM (HPA on memory has been proved inefficient with memory surges), and it also leaves room for events with greater sizes. Publish bufferWith 100Mi buffer size (default value), we have started to see significant more 500s with 800 qps 256k size. Setting 200Mi mitigates this problem (under the same load). I also want to avoid setting a higher buffer size as it might allow a lot more mem accumulation and cause OOM. See comment. HPAI want to have CPU being the main factor for auto scaling. I'm trying to set the mem limit being the upper bound I've seen (with the target load). So hopefully in practice it will never use mem to scale up. |
These numbers make sense to me, thanks for the investigation! Just a couple of followups:
|
Not sure if it's close to "linear" but when under 2000 qps I definitely see more qps causes more cpu and vice versa.
I was not considering a different memory requirement/limit; I assumed they have to be the same (as gke recommended). The numbers you mentioned would probably also work. |
Great research, thanks @yolocs! On your memory chart I see a surge in usage to 9GiB. Will this be an issue if we set memory limit to 2GiB? Seems like we should investigate the reason for that memory usage surge to determine if it can be eliminated or mitigated. |
@grantr Sorry for the confusion. That surge was caused by a different setup. I was too lazy to cut that part out in my screenshot |
Ah got it. I still think we should look into why the memory surge is happening :) |
@grantr Fair point :) In practice, there are some challenges I have seen so far.
The idea I'm having now is to have two separate sets of tests. 1) Test the memory pattern of an HTTP server 2) Test the memory pattern of a Pubsub publish client. If one of them shows a similar pattern as I have seen, then it could be cause. |
In all the follow-up tests, I was not able to reliably reproduce the memory surges. In fact, under gradual qps/size increase, I observed no obvious memory usage change at all. Although at the certain point, the pubsub publish buffer limit error starts to surface more often. One amendment for my proposed values is to make Pubsub publish buffer size limit 300Mi. In the follow-up tests, this value more reliably prevents the reaching-buffer-limit error when the load is slightly beyond the target limit. |
Problem
Once we have an events per second target #1596 and a default event payload size #1599, we can measure the resource requirements for a single ingress pod to meet the target. That's our default resource requirement value.
Persona:
User
Exit Criteria
The broker ingress deployment is created with default resource requirements allowing it to meet the default events per second target.
Additional context (optional)
Part of #1552.
The text was updated successfully, but these errors were encountered: