Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cAdvisor performance with perfs #2608

Open
wacuuu opened this issue Jul 3, 2020 · 4 comments
Open

cAdvisor performance with perfs #2608

wacuuu opened this issue Jul 3, 2020 · 4 comments

Comments

@wacuuu
Copy link

wacuuu commented Jul 3, 2020

Hi,

I did some performance measurement to get an insight on how would cAdvisor with perfs enabled behave on production-like system. A word about my setup:

I have cAdvisor deployed as daemoset as it is from examples, only differences are that CPU limit is set to 5(to avoid bottleneck here) and I have changed perf configs. As for the load, I have a deployment that consist of a single pod with two containers, they only exist to be entities for metrics generation. Besides cAdvisor and load the machine has only the containers responsible for keeping the node in the cluster(network manager, internal Kubernetes services etc.)

To measure response time and data volume, I executed the following command from the node running load and cAdvisor:

time curl localhost:8080/metrics > /dev/null

These are the results:

Number of containers Number of enabled perf counters Time of scraping(s) Scraped data volume(MB)
16 0 0.336 11.5
16 1 0.395 16
22 1 0.45 19.1
40 1 0.736 26.8
60 1 5.72 38.2
16 5 0.4 16.4
22 5 0.44 19.4
40 5 0.73 27.1
60 5 2.62 37.8
16 10 1.4 58.4
22 10 1.71 73.3
40 10 2.65 111
60 10 8.65 156

I would like to emphasize, that 60 containers on 40 core machine is not the worst case scenario that could happen. Also, in terms of data scraping this is optimistic case, where the process is not exposed to datacenter network, in which typically there would be traffic from other applications and other nodes.

Production environment would be expected to have hundreds of nodes, on each cAdvisor measuring couple of perf events, with hundreds of containers per node, all this scrapped every couple of seconds. With numbers like in the tests, it is highly unlikely that this setup would not brick production with network overload. Therefore there is a need to do some optimization in the amount of data served with perf events.

@katarzyna-z
Copy link
Collaborator

katarzyna-z commented Jul 3, 2020

60 conatainers, 40 cores, 10 perf counters - it gives 24 000 perf metrics (60x40x10), data volume of scraped Prometheus metrics: 144.5 MB (156 MB - 11.5MB based on data from table)

I think that we can try to aggregate perf metrics by "event" and "id" and expose them in this form on Prometheus endpoint. Aggregated form of metrics for 60 containers, 40 cores, 10 perf counters will have 600 perf metrics (60x10) and estimated data volume for aggregated perf metrics is about 3.6125 MB (144.5/40) so significantly less. In my opinion we could add an additional runtime parameter for cAdvisor e.g. --perf_aggregate=true

@dashpole what do you think?

@katarzyna-z
Copy link
Collaborator

Idea with aggregation is shown in #2611

@dashpole
Copy link
Collaborator

dashpole commented Jul 6, 2020

Could we make perf metrics respect the percpu disable_metrics parameter?

@katarzyna-z
Copy link
Collaborator

I think that we can use percpu disable_metrics parameter for perf metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants