Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

appbench: Detailed per-app performance analysis [WIP] #615

Closed
wants to merge 4 commits into from

Conversation

lukego
Copy link
Member

@lukego lukego commented Sep 13, 2015

This is an experimental idea to use the CPU PMU support (#597) to measure the performance impact of introducing a new segment into an app network. This proof-of-concept implementation measures the performance of a reference app network (Source->Sink) and then measures the increase caused by extending the app network (Source->[myapp]->Sink).

The result printed estimates the impact of the app on metrics such as cycles per packet, L1/L2/L3 cache accesses per packet, and branch mispredictions per packet.

The app to introduce is given using a minimal command-line syntax:

snabbmark appbench <module> <app> <config> <in-link> <out-link>

For example here is the full output when measuring the Tee app. Three sets of results are printed: the reference app network (Source->Sink), the production app network (Source->Tee->Sink), and the delta of the two (->Tee->):

$ sudo taskset -c 0 ./snabb snabbmark appbench apps.basic.basic_apps Tee

reference result:
EVENT                                             TOTAL     /packet
cycles                                    4,926,970,014      49.270
ref_cycles                                3,697,570,944      36.976
instructions                             11,313,124,384     113.131
br_misp_retired.all_branches                  1,992,849       0.020
mem_load_uops_retired.l1_hit              3,406,962,877      34.070
mem_load_uops_retired.l2_hit                134,519,317       1.345
mem_load_uops_retired.l3_hit                    589,529       0.006
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

starting production run...
production result:
EVENT                                             TOTAL     /packet
cycles                                    6,336,783,312      63.368
ref_cycles                                4,754,010,960      47.540
instructions                             15,171,830,862     151.718
br_misp_retired.all_branches                  2,354,542       0.024
mem_load_uops_retired.l1_hit              4,731,467,983      47.315
mem_load_uops_retired.l2_hit                244,058,833       2.441
mem_load_uops_retired.l3_hit                    769,751       0.008
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    1,409,813,298      14.098
ref_cycles                                1,056,440,016      10.564
instructions                              3,858,706,478      38.587
br_misp_retired.all_branches                    361,693       0.004
mem_load_uops_retired.l1_hit              1,324,505,106      13.245
mem_load_uops_retired.l2_hit                109,539,516       1.095
mem_load_uops_retired.l3_hit                    180,222       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

Here we see an estimate for the Tee app of 14 cycles per packet, 13 L1 cache accesses per packet, and one L2 cache access per packet.

Here is the abbreviated (delta only) output for a pcap_filter app:

$ sudo taskset -c 0 ./snabb snabbmark appbench apps.packet_filter.pcap_filter PcapFilter '{filter = "not ip6 and not ip and not ether broadcast"}'
...
difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    1,701,400,725      17.014
ref_cycles                                1,198,657,392      11.987
instructions                              4,413,841,189      44.138
br_misp_retired.all_branches                    203,686       0.002
mem_load_uops_retired.l1_hit              1,389,742,167      13.897
mem_load_uops_retired.l2_hit                211,134,462       2.111
mem_load_uops_retired.l3_hit                    152,820       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

The performance here is similar and we can observe one extra L2 cache access. (I wonder what these L2 cache accesses are?)

And for the keyed_ipv6_tunnel app:

$ sudo taskset -c 0 ./snabb snabbmark appbench apps.keyed_ipv6_tunnel.tunnel SimpleKeyedTunnel '{ local_address = "00::2:1", remote_address = "00::2:1", local_cookie = "12345678", remote_cookie = "12345678", default_gateway_MAC = "a1:b2:c3:d4:e5:f6" }' decapsulated encapsulated
difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    4,265,246,460      42.652
ref_cycles                                3,169,706,352      31.697
instructions                             12,000,681,785     120.007
br_misp_retired.all_branches                    675,747       0.007
mem_load_uops_retired.l1_hit              3,773,725,036      37.737
mem_load_uops_retired.l2_hit                230,747,398       2.307
mem_load_uops_retired.l3_hit                    236,410       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

which is more expensive and has much more L1 cache access (moving packet payload in place).

Summary

This is really a proof of concept. I see some good things and some bad things.

Good:

  1. Start to see per-app behavior that can be useful for optimization.
  2. Seems like this method should be fairly robust to measurement error because it avoids any instrumentation overhead.

Bad:

  1. Source->Sink is limited and unrealistic because no I/O is happening. Many apps will have quite different performance depending on whether packet data is in L1/L2/L3/DRAM and this needs to be understood.
  2. Command line syntax is clunky.

I would actually quite like to experiment with having the engine automatically sample performance counters for every app on each call to pull() and push(). This proof-of-concept will be useful to support that work because results could be compared to estimate measurement error (which I see as the main risk for counting each app callback separately).

This PR is based on musing in #603.

Keep the arrays containing machine code alive by storing references to
them in a Lua table. Otherwise they will be garbage collected.

dynasm uses a GC callback to unmap memory that was used for generated
code and so the most likely consequence is a segfault. Here is how it
looks in dmesg:

  segfault at 7fe0d50de000 ip 00007fe0d50de000 sp 00007ffcd2c89cb8 error 14

where "error 14" means an error during instruction fetch.

This problem triggered immediately when using the pmu library with
non-trivial code under test (running an app network).
Renamed 'ref-cycles' to 'ref_cycles' both for consistency with other
counters and to make it easier to use as a Lua table key.

report() now takes a table for input rather than a counterset. This
makes it easy to write Lua code that manipulates data (e.g. taking
deltas from separate runs) and then call report() to format it.

report() now lexically sorts the counters based on their names, with
the exception of the fixed-purpose counters (cycles, instructions,
ref_cycles) that are printed first in a fixed order. This is intended
to increase consistency and make it slightly easier to compare results
by eyeball.
appbench measures the impact on CPU performance counters when a new
app is introduced to an app network between a Source and a Sink.

This can be used to analyze the behavior of an individual app: for
example how many cycles per packet it consumes, how many accesses to
each level of the cache heirarchy per packet, and how many branch
mispredictions per packet.

Here is the example analysis of the Tee app processing 100M packets:

    EVENT                                             TOTAL     /packet
    cycles                                    1,409,813,298      14.098
    ref_cycles                                1,056,440,016      10.564
    instructions                              3,858,706,478      38.587
    br_misp_retired.all_branches                    361,693       0.004
    mem_load_uops_retired.l1_hit              1,324,505,106      13.245
    mem_load_uops_retired.l2_hit                109,539,516       1.095
    mem_load_uops_retired.l3_hit                    180,222       0.002
    mem_load_uops_retired.l3_miss                         0       0.000
    packet                                      100,000,000       1.000
lukego added a commit to lukego/snabb that referenced this pull request Sep 13, 2015
This is a quick change to make the 'snabbmark basic1' benchmark print
per-app PMU counters.

It also comments out one of the links on the 'Tee' app to make the
basic1 benchmark more directly comparable with the benchmark in snabbco#615.
@lukego lukego mentioned this pull request Sep 13, 2015
@lukego
Copy link
Member Author

lukego commented Feb 7, 2016

Closing PR: I am not actively developing this branch.

@lukego lukego closed this Feb 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant