Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate perf metrics #2611

Merged
merged 1 commit into from
Aug 17, 2020
Merged

Conversation

katarzyna-z
Copy link
Collaborator

Signed-off-by: Katarzyna Kujawa katarzyna.kujawa@intel.com

I want to test it more and add more unit tests.
Related to #2608

@katarzyna-z katarzyna-z force-pushed the kk_aggregate_perf branch 5 times, most recently from d2eb985 to cb52f40 Compare July 8, 2020 14:09
@katarzyna-z
Copy link
Collaborator Author

Tested with 8 cores, 60 containers, 5 perf events. configuration file:

{
  "core": {
    "events": [
      ["instructions"],
      ["instructions_retired"],
      ["cycles"],
      ["ref-cycles"],
      ["context-switches"]
    ],
    "custom_events": [
      {
        "type": 4,
        "config": [
          "0x5300c0"
        ],
        "name": "instructions_retired"
      }
    ]
  }
}

Aggregated perf events
command: sudo ./cadvisor -perf_events_config=perf/testing/perf.json -disable_metrics="accelerator,disk,diskIO,network,process,hugetlb,resctrl,udp,advtcp,referenced_memory,cpu_topology,tcp,sched,percpu"
response time on /metrics endpoint: 0.419s
data size of response: 5MB

Per CPU perf events
command: sudo ./cadvisor -perf_events_config=perf/testing/perf.json -disable_metrics="accelerator,disk,diskIO,network,process,hugetlb,resctrl,udp,advtcp,referenced_memory,cpu_topology,tcp,sched"
response time on /metrics endpoint: 0.514s
data size of response: 14MB

@wacuuu Could you test this pull request in your environment?

@wacuuu
Copy link

wacuuu commented Jul 9, 2020

baseline for disable_metrics is set to
udp,referenced_memory,resctrl,hugetlb,cpu_topology,tcp,advtcp,sched,process, i only add or remove percpu parameter

To measure the parameters i run the following command on the node that is running cadvisor in container:
time curl localhost:8080/metrics > /dev/null

Number of containers Number of enabled perf counters Time of scraping(s) Scraped data volume(MB) disable percpu
58 5 0.89s 20M yes
58 5 3.07s 97M no
-- -- -- -- --
58 10 1.82s 21.8M yes
58 10 12.92s 168M no
-- -- -- -- --
32 10 0.86s 15.3M yes
32 10 4.81s 21.4M no

So the conclusion would be that percpu significantly improves the situation.

@katarzyna-z katarzyna-z changed the title [WIP] Aggregate perf metrics Aggregate perf metrics Jul 9, 2020
@katarzyna-z
Copy link
Collaborator Author

@dashpole it is ready for review.

- `--disable_metrics="percpu"` - core perf events are aggregated
- `--disable_metrics=""` - core perf events are exposed per CPU.

Aggregated form of core perf events significantly decrease volume of data. For aggregated form of core perf events scaling ratio (`container_perf_metric_scaling ratio`) indicates average scaling ratio for specific event.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "average" is no useful comparing to the worst case when scaling happened
after aggregation with average equal 90% one doesnt know if the all cpus were scaled with by 10% or most of them were 100% and one was running only 5% and scaled 20 times (quite significant error)

You should add or replace (avg) with min scaling ratio accross all cpus.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so I'll replace average with min

@katarzyna-z
Copy link
Collaborator Author

/retest

@dashpole dashpole self-assigned this Aug 13, 2020
@dashpole
Copy link
Collaborator

@katarzyna-z is it possible to unit test this?

Add documentation about core perf events aggregation

Signed-off-by: Katarzyna Kujawa <katarzyna.kujawa@intel.com>
@@ -44,6 +44,21 @@ func TestPrometheusCollector(t *testing.T) {
testPrometheusCollector(t, reg, "testdata/prometheus_metrics")
}

func TestPrometheusCollectorWithPerfAggregated(t *testing.T) {
Copy link
Collaborator Author

@katarzyna-z katarzyna-z Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashpole Unit tests are available in this file. Do you think that I should add something?

Copy link
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dashpole dashpole merged commit 65e04ec into google:master Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants