-
Notifications
You must be signed in to change notification settings - Fork 74
Determine resource request/limit for broker data plane components #876
Comments
This is P1 for the requests, but not P1 for the limits value or repeatable benchmarks. |
Clarification on scope: request is needed to determine minimum cluster footprint. Optimal request size for performance is not necessary for R1. |
Setup
Accumulating long connections and retries in retry service. Every 2m there is a surge of requests because of the echos. Results (all single pod)k top pods 130 ↵
NAME CPU(cores) MEMORY(bytes)
broker-fanout-76867ffc8-rdpzn 1670m 166Mi
broker-ingress-746f69945-hvvfz 610m 64Mi
broker-retry-7cc4559df5-l72bh 570m 239Mi tl;dr: Fanout is pretty CPU intensive while retry is slightly more memory intensive when there are a lot of events to be retried. Results (more replicas)Fanout - 3 replicas Retry - 3 replicas k top pods
NAME CPU(cores) MEMORY(bytes)
broker-fanout-76867ffc8-jfbrq 1047m 291Mi
broker-fanout-76867ffc8-rdpzn 984m 166Mi
broker-fanout-76867ffc8-rqwzn 722m 277Mi
broker-ingress-746f69945-hvvfz 857m 85Mi
broker-retry-7cc4559df5-4plhb 251m 609Mi
broker-retry-7cc4559df5-l72bh 503m 239Mi
broker-retry-7cc4559df5-r7pt4 335m 530Mi With increased replicas, fanout/retry was able to distribute the load and there is less CPU used in every pod. However, it doesn't seem to impact memory usage as much. I guess it's because that memory usage has a baseline that depends on the number of pull subs it has (and outstanding bytes for each pull sub). Retry pod (239Mi) with significant less memory usage seems to handle more long connections. When the ingress replicas were not increased, I've observed some reply delivery failures. This is an indicator that ingress might be reach its limit. ConclusionProposed resource limit:
Action items:
|
Thanks @yolocs, this is really interesting. Great to see real numbers. Was the load the same in the "single pods" run as in the "more replicas" run? I'm curious why retry was at 239MiB with a single pod but more per pod with replicas. Seems like if the retry memory usage was determined by number of pull subs and outstanding bytes, the single pod would have >= memory usage as multiple replicas. |
After almost a day of running, things turned pretty ugly for retry service.
It's now consuming a lot of memory and caused a lot of evicts.
Based on the profiling data, it seems more than half of the memory usage was spent on http @ian-mi as you might have some insights. I'm updating the slow subscribers to be not-slow. And expecting memory consumption drops. |
This behavior further confirms the need to:
|
Thanks for the numbers and insights! A few questions/comments:
|
Limiting the outstanding messages/bytes seems to effectively throttle the resources usage. Here is a snapshot after I changed outstanding messages to 10 and bytes to 10000.
|
Proposing new values (according to new guideline, no CPU limits and memory request=limit):
What we have observed so far:
What we haven't confirmed:
In summary there isn't much science behind these proposed values. But they should work if we have HPA to distribute the load. |
Problem
We need to run some benchmark tests to determine a reasonable resource requests/limits for broker data plane.
Persona:
Which persona is this feature for?
Operator
Exit Criteria
A recommended resource request/limit and why.
Time Estimate (optional):
How many developer-days do you think this may take to resolve?
Additional context (optional)
The text was updated successfully, but these errors were encountered: