cAdvisor performance with perfs #2608

wacuuu · 2020-07-03T09:16:51Z

Hi,

I did some performance measurement to get an insight on how would cAdvisor with perfs enabled behave on production-like system. A word about my setup:

I have cAdvisor deployed as daemoset as it is from examples, only differences are that CPU limit is set to 5(to avoid bottleneck here) and I have changed perf configs. As for the load, I have a deployment that consist of a single pod with two containers, they only exist to be entities for metrics generation. Besides cAdvisor and load the machine has only the containers responsible for keeping the node in the cluster(network manager, internal Kubernetes services etc.)

To measure response time and data volume, I executed the following command from the node running load and cAdvisor:

time curl localhost:8080/metrics > /dev/null

These are the results:

Number of containers	Number of enabled perf counters	Time of scraping(s)	Scraped data volume(MB)
16	0	0.336	11.5
16	1	0.395	16
22	1	0.45	19.1
40	1	0.736	26.8
60	1	5.72	38.2
16	5	0.4	16.4
22	5	0.44	19.4
40	5	0.73	27.1
60	5	2.62	37.8
16	10	1.4	58.4
22	10	1.71	73.3
40	10	2.65	111
60	10	8.65	156

I would like to emphasize, that 60 containers on 40 core machine is not the worst case scenario that could happen. Also, in terms of data scraping this is optimistic case, where the process is not exposed to datacenter network, in which typically there would be traffic from other applications and other nodes.

Production environment would be expected to have hundreds of nodes, on each cAdvisor measuring couple of perf events, with hundreds of containers per node, all this scrapped every couple of seconds. With numbers like in the tests, it is highly unlikely that this setup would not brick production with network overload. Therefore there is a need to do some optimization in the amount of data served with perf events.

The text was updated successfully, but these errors were encountered:

katarzyna-z · 2020-07-03T10:00:40Z

60 conatainers, 40 cores, 10 perf counters - it gives 24 000 perf metrics (60x40x10), data volume of scraped Prometheus metrics: 144.5 MB (156 MB - 11.5MB based on data from table)

I think that we can try to aggregate perf metrics by "event" and "id" and expose them in this form on Prometheus endpoint. Aggregated form of metrics for 60 containers, 40 cores, 10 perf counters will have 600 perf metrics (60x10) and estimated data volume for aggregated perf metrics is about 3.6125 MB (144.5/40) so significantly less. In my opinion we could add an additional runtime parameter for cAdvisor e.g. --perf_aggregate=true

@dashpole what do you think?

katarzyna-z · 2020-07-06T13:28:48Z

Idea with aggregation is shown in #2611

dashpole · 2020-07-06T17:51:19Z

Could we make perf metrics respect the percpu disable_metrics parameter?

katarzyna-z · 2020-07-07T11:27:26Z

I think that we can use percpu disable_metrics parameter for perf metrics

katarzyna-z mentioned this issue Jul 6, 2020

Aggregate perf metrics #2611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cAdvisor performance with perfs #2608

cAdvisor performance with perfs #2608

wacuuu commented Jul 3, 2020

katarzyna-z commented Jul 3, 2020 •

edited

Loading

katarzyna-z commented Jul 6, 2020

dashpole commented Jul 6, 2020

katarzyna-z commented Jul 7, 2020

cAdvisor performance with perfs #2608

cAdvisor performance with perfs #2608

Comments

wacuuu commented Jul 3, 2020

katarzyna-z commented Jul 3, 2020 • edited Loading

katarzyna-z commented Jul 6, 2020

dashpole commented Jul 6, 2020

katarzyna-z commented Jul 7, 2020

katarzyna-z commented Jul 3, 2020 •

edited

Loading