Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine.report_pmu() [WIP] #616

Closed
wants to merge 4 commits into from
Closed

engine.report_pmu() [WIP] #616

wants to merge 4 commits into from

Conversation

lukego
Copy link
Member

@lukego lukego commented Sep 13, 2015

This PR is an alternative to #615 for tracking performance counter values per app.

This branch counts the performance events during each individual push() and pull() call and accumulates a total for each app. That is to say that it directly counts the number cycles, cache misses, etc, that occur during each callback.

The function engine.report_pmu() prints the values for each app and also calculates per-packet values. (The number of packets the apps has processed is determined by its links.)

The event counting seems to work surprisingly well. I was concerned that the individual callbacks would be too short and so there would be too much noise when trying to count their PMU events. However, the initial results are very consistent with the numbers reported in #615 based on long running averages.

Results for a Tee app in a Source->Tee->Sink network from this branch:

$ sudo taskset -c 0 ./snabb snabbmark basic1 1e9
...
*** Tee
EVENT                                             TOTAL     /packet
cycles                                   14,784,901,928      14.780
ref_cycles                               11,104,338,216      11.100
instructions                             36,742,279,315      36.729
br_misp_retired.all_branches                  3,992,662       0.004
mem_load_uops_retired.l1_hit             12,783,402,186      12.779
mem_load_uops_retired.l2_hit                991,799,311       0.991
mem_load_uops_retired.l3_hit                     61,993       0.000
mem_load_uops_retired.l3_miss                         0       0.000
packet                                    1,000,359,900       1.000

and from #615 with comparable /packet column:

difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    1,409,813,298      14.098
ref_cycles                                1,056,440,016      10.564
instructions                              3,858,706,478      38.587
br_misp_retired.all_branches                    361,693       0.004
mem_load_uops_retired.l1_hit              1,324,505,106      13.245
mem_load_uops_retired.l2_hit                109,539,516       1.095
mem_load_uops_retired.l3_hit                    180,222       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

The overhead is significant however and the basic1 benchmark loses around 1/3 of its throughput when sampling the PMU counters. So you would only enable this feature when you are willing to sacrifice overall performance to see per-app performance.

Could be a good plan to tidy up this code and merge it in preference to #615.

I am still interested in having an appbench style program that can generate a "datasheet" for an app that estimates how it will perform in different situations (packet size, traffic mix, data in L1/L2/L3/DRAM, etc). However, it should be possible to build that on this code anyway.

Keep the arrays containing machine code alive by storing references to
them in a Lua table. Otherwise they will be garbage collected.

dynasm uses a GC callback to unmap memory that was used for generated
code and so the most likely consequence is a segfault. Here is how it
looks in dmesg:

  segfault at 7fe0d50de000 ip 00007fe0d50de000 sp 00007ffcd2c89cb8 error 14

where "error 14" means an error during instruction fetch.

This problem triggered immediately when using the pmu library with
non-trivial code under test (running an app network).
Renamed 'ref-cycles' to 'ref_cycles' both for consistency with other
counters and to make it easier to use as a Lua table key.

report() now takes a table for input rather than a counterset. This
makes it easy to write Lua code that manipulates data (e.g. taking
deltas from separate runs) and then call report() to format it.

report() now lexically sorts the counters based on their names, with
the exception of the fixed-purpose counters (cycles, instructions,
ref_cycles) that are printed first in a fixed order. This is intended
to increase consistency and make it slightly easier to compare results
by eyeball.
This is a quick implementation that is hard coded to enabled.

Prints a report like this:

    *** Tee
    EVENT                                             TOTAL     /packet
    cycles                                   14,872,691,360      14.868
    ref_cycles                               11,169,642,000      11.166
    instructions                             36,740,872,664      36.728
    br_misp_retired.all_branches                  3,976,251       0.004
    mem_load_uops_retired.l1_hit             12,770,250,074      12.766
    mem_load_uops_retired.l2_hit              1,022,848,683       1.022
    mem_load_uops_retired.l3_hit                    130,238       0.000
    mem_load_uops_retired.l3_miss                         0       0.000
    packet                                    1,000,347,915       1.000
    *** Source
    EVENT                                             TOTAL     /packet
    cycles                                   31,175,791,454      31.165
    ref_cycles                               23,407,559,568      23.399
    instructions                             60,442,148,650      60.421
    br_misp_retired.all_branches                 10,879,179       0.011
    mem_load_uops_retired.l1_hit             23,403,995,255      23.396
    mem_load_uops_retired.l2_hit                107,486,748       0.107
    mem_load_uops_retired.l3_hit                    281,230       0.000
    mem_load_uops_retired.l3_miss                         0       0.000
    packet                                    1,000,347,915       1.000
    *** Sink
    EVENT                                             TOTAL     /packet
    cycles                                   19,251,176,856      19.244
    ref_cycles                               14,454,619,392      14.450
    instructions                             45,548,677,978      45.533
    br_misp_retired.all_branches                  3,945,214       0.004
    mem_load_uops_retired.l1_hit              8,737,428,004       8.734
    mem_load_uops_retired.l2_hit              1,016,554,792       1.016
    mem_load_uops_retired.l3_hit                    121,803       0.000
    mem_load_uops_retired.l3_miss                         0       0.000
    packet                                    1,000,347,915       1.000
This is a quick change to make the 'snabbmark basic1' benchmark print
per-app PMU counters.

It also comments out one of the links on the 'Tee' app to make the
basic1 benchmark more directly comparable with the benchmark in snabbco#615.
@lukego lukego changed the title engine.report_pmu() engine.report_pmu() [WIP] Sep 13, 2015
@lukego
Copy link
Member Author

lukego commented Feb 7, 2016

Closing PR: I am not actively developing this branch.

@lukego lukego closed this Feb 7, 2016
dpino pushed a commit to dpino/snabb that referenced this pull request Dec 6, 2016
Switch to 1-based indexing in snabb-softwire-v1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant