engine.report_pmu() [WIP] #616

lukego · 2015-09-13T21:16:06Z

This PR is an alternative to #615 for tracking performance counter values per app.

This branch counts the performance events during each individual push() and pull() call and accumulates a total for each app. That is to say that it directly counts the number cycles, cache misses, etc, that occur during each callback.

The function engine.report_pmu() prints the values for each app and also calculates per-packet values. (The number of packets the apps has processed is determined by its links.)

The event counting seems to work surprisingly well. I was concerned that the individual callbacks would be too short and so there would be too much noise when trying to count their PMU events. However, the initial results are very consistent with the numbers reported in #615 based on long running averages.

Results for a Tee app in a Source->Tee->Sink network from this branch:

$ sudo taskset -c 0 ./snabb snabbmark basic1 1e9
...
*** Tee
EVENT                                             TOTAL     /packet
cycles                                   14,784,901,928      14.780
ref_cycles                               11,104,338,216      11.100
instructions                             36,742,279,315      36.729
br_misp_retired.all_branches                  3,992,662       0.004
mem_load_uops_retired.l1_hit             12,783,402,186      12.779
mem_load_uops_retired.l2_hit                991,799,311       0.991
mem_load_uops_retired.l3_hit                     61,993       0.000
mem_load_uops_retired.l3_miss                         0       0.000
packet                                    1,000,359,900       1.000

and from #615 with comparable /packet column:

difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    1,409,813,298      14.098
ref_cycles                                1,056,440,016      10.564
instructions                              3,858,706,478      38.587
br_misp_retired.all_branches                    361,693       0.004
mem_load_uops_retired.l1_hit              1,324,505,106      13.245
mem_load_uops_retired.l2_hit                109,539,516       1.095
mem_load_uops_retired.l3_hit                    180,222       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

The overhead is significant however and the basic1 benchmark loses around 1/3 of its throughput when sampling the PMU counters. So you would only enable this feature when you are willing to sacrifice overall performance to see per-app performance.

Could be a good plan to tidy up this code and merge it in preference to #615.

I am still interested in having an appbench style program that can generate a "datasheet" for an app that estimates how it will perform in different situations (packet size, traffic mix, data in L1/L2/L3/DRAM, etc). However, it should be possible to build that on this code anyway.

Keep the arrays containing machine code alive by storing references to them in a Lua table. Otherwise they will be garbage collected. dynasm uses a GC callback to unmap memory that was used for generated code and so the most likely consequence is a segfault. Here is how it looks in dmesg: segfault at 7fe0d50de000 ip 00007fe0d50de000 sp 00007ffcd2c89cb8 error 14 where "error 14" means an error during instruction fetch. This problem triggered immediately when using the pmu library with non-trivial code under test (running an app network).

Renamed 'ref-cycles' to 'ref_cycles' both for consistency with other counters and to make it easier to use as a Lua table key. report() now takes a table for input rather than a counterset. This makes it easy to write Lua code that manipulates data (e.g. taking deltas from separate runs) and then call report() to format it. report() now lexically sorts the counters based on their names, with the exception of the fixed-purpose counters (cycles, instructions, ref_cycles) that are printed first in a fixed order. This is intended to increase consistency and make it slightly easier to compare results by eyeball.

This is a quick implementation that is hard coded to enabled. Prints a report like this: *** Tee EVENT TOTAL /packet cycles 14,872,691,360 14.868 ref_cycles 11,169,642,000 11.166 instructions 36,740,872,664 36.728 br_misp_retired.all_branches 3,976,251 0.004 mem_load_uops_retired.l1_hit 12,770,250,074 12.766 mem_load_uops_retired.l2_hit 1,022,848,683 1.022 mem_load_uops_retired.l3_hit 130,238 0.000 mem_load_uops_retired.l3_miss 0 0.000 packet 1,000,347,915 1.000 *** Source EVENT TOTAL /packet cycles 31,175,791,454 31.165 ref_cycles 23,407,559,568 23.399 instructions 60,442,148,650 60.421 br_misp_retired.all_branches 10,879,179 0.011 mem_load_uops_retired.l1_hit 23,403,995,255 23.396 mem_load_uops_retired.l2_hit 107,486,748 0.107 mem_load_uops_retired.l3_hit 281,230 0.000 mem_load_uops_retired.l3_miss 0 0.000 packet 1,000,347,915 1.000 *** Sink EVENT TOTAL /packet cycles 19,251,176,856 19.244 ref_cycles 14,454,619,392 14.450 instructions 45,548,677,978 45.533 br_misp_retired.all_branches 3,945,214 0.004 mem_load_uops_retired.l1_hit 8,737,428,004 8.734 mem_load_uops_retired.l2_hit 1,016,554,792 1.016 mem_load_uops_retired.l3_hit 121,803 0.000 mem_load_uops_retired.l3_miss 0 0.000 packet 1,000,347,915 1.000

This is a quick change to make the 'snabbmark basic1' benchmark print per-app PMU counters. It also comments out one of the links on the 'Tee' app to make the basic1 benchmark more directly comparable with the benchmark in snabbco#615.

lukego · 2016-02-07T04:34:03Z

Closing PR: I am not actively developing this branch.

Switch to 1-based indexing in snabb-softwire-v1

lukego added 4 commits September 13, 2015 11:27

snabbmark basic1: Use app.report_pmu()

4b4c3fa

This is a quick change to make the 'snabbmark basic1' benchmark print per-app PMU counters. It also comments out one of the links on the 'Tee' app to make the basic1 benchmark more directly comparable with the benchmark in snabbco#615.

lukego changed the title ~~engine.report_pmu()~~ engine.report_pmu() [WIP] Sep 13, 2015

lukego mentioned this pull request Sep 28, 2015

Snabb data structures: packets, links, and apps lukego/blog#11

Open

lukego mentioned this pull request Oct 31, 2015

Packet copies: Expensive or cheap? #648

Open

lukego closed this Feb 7, 2016

dpino pushed a commit to dpino/snabb that referenced this pull request Dec 6, 2016

Merge pull request snabbco#616 from Igalia/migrate-configuration-again

ec7cf69

Switch to 1-based indexing in snabb-softwire-v1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

engine.report_pmu() [WIP] #616

engine.report_pmu() [WIP] #616

lukego commented Sep 13, 2015

lukego commented Feb 7, 2016

engine.report_pmu() [WIP] #616

engine.report_pmu() [WIP] #616

Conversation

lukego commented Sep 13, 2015

lukego commented Feb 7, 2016