appbench: Detailed per-app performance analysis [WIP] #615

lukego · 2015-09-13T14:18:13Z

This is an experimental idea to use the CPU PMU support (#597) to measure the performance impact of introducing a new segment into an app network. This proof-of-concept implementation measures the performance of a reference app network (Source->Sink) and then measures the increase caused by extending the app network (Source->[myapp]->Sink).

The result printed estimates the impact of the app on metrics such as cycles per packet, L1/L2/L3 cache accesses per packet, and branch mispredictions per packet.

The app to introduce is given using a minimal command-line syntax:

snabbmark appbench <module> <app> <config> <in-link> <out-link>

For example here is the full output when measuring the Tee app. Three sets of results are printed: the reference app network (Source->Sink), the production app network (Source->Tee->Sink), and the delta of the two (->Tee->):

$ sudo taskset -c 0 ./snabb snabbmark appbench apps.basic.basic_apps Tee

reference result:
EVENT                                             TOTAL     /packet
cycles                                    4,926,970,014      49.270
ref_cycles                                3,697,570,944      36.976
instructions                             11,313,124,384     113.131
br_misp_retired.all_branches                  1,992,849       0.020
mem_load_uops_retired.l1_hit              3,406,962,877      34.070
mem_load_uops_retired.l2_hit                134,519,317       1.345
mem_load_uops_retired.l3_hit                    589,529       0.006
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

starting production run...
production result:
EVENT                                             TOTAL     /packet
cycles                                    6,336,783,312      63.368
ref_cycles                                4,754,010,960      47.540
instructions                             15,171,830,862     151.718
br_misp_retired.all_branches                  2,354,542       0.024
mem_load_uops_retired.l1_hit              4,731,467,983      47.315
mem_load_uops_retired.l2_hit                244,058,833       2.441
mem_load_uops_retired.l3_hit                    769,751       0.008
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    1,409,813,298      14.098
ref_cycles                                1,056,440,016      10.564
instructions                              3,858,706,478      38.587
br_misp_retired.all_branches                    361,693       0.004
mem_load_uops_retired.l1_hit              1,324,505,106      13.245
mem_load_uops_retired.l2_hit                109,539,516       1.095
mem_load_uops_retired.l3_hit                    180,222       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

Here we see an estimate for the Tee app of 14 cycles per packet, 13 L1 cache accesses per packet, and one L2 cache access per packet.

Here is the abbreviated (delta only) output for a pcap_filter app:

$ sudo taskset -c 0 ./snabb snabbmark appbench apps.packet_filter.pcap_filter PcapFilter '{filter = "not ip6 and not ip and not ether broadcast"}'
...
difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    1,701,400,725      17.014
ref_cycles                                1,198,657,392      11.987
instructions                              4,413,841,189      44.138
br_misp_retired.all_branches                    203,686       0.002
mem_load_uops_retired.l1_hit              1,389,742,167      13.897
mem_load_uops_retired.l2_hit                211,134,462       2.111
mem_load_uops_retired.l3_hit                    152,820       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

The performance here is similar and we can observe one extra L2 cache access. (I wonder what these L2 cache accesses are?)

And for the keyed_ipv6_tunnel app:

$ sudo taskset -c 0 ./snabb snabbmark appbench apps.keyed_ipv6_tunnel.tunnel SimpleKeyedTunnel '{ local_address = "00::2:1", remote_address = "00::2:1", local_cookie = "12345678", remote_cookie = "12345678", default_gateway_MAC = "a1:b2:c3:d4:e5:f6" }' decapsulated encapsulated
difference from reference to production:
EVENT                                             TOTAL     /packet
cycles                                    4,265,246,460      42.652
ref_cycles                                3,169,706,352      31.697
instructions                             12,000,681,785     120.007
br_misp_retired.all_branches                    675,747       0.007
mem_load_uops_retired.l1_hit              3,773,725,036      37.737
mem_load_uops_retired.l2_hit                230,747,398       2.307
mem_load_uops_retired.l3_hit                    236,410       0.002
mem_load_uops_retired.l3_miss                         0       0.000
packet                                      100,000,000       1.000

which is more expensive and has much more L1 cache access (moving packet payload in place).

Summary

This is really a proof of concept. I see some good things and some bad things.

Good:

Start to see per-app behavior that can be useful for optimization.
Seems like this method should be fairly robust to measurement error because it avoids any instrumentation overhead.

Bad:

Source->Sink is limited and unrealistic because no I/O is happening. Many apps will have quite different performance depending on whether packet data is in L1/L2/L3/DRAM and this needs to be understood.
Command line syntax is clunky.

I would actually quite like to experiment with having the engine automatically sample performance counters for every app on each call to pull() and push(). This proof-of-concept will be useful to support that work because results could be compared to estimate measurement error (which I see as the main risk for counting each app callback separately).

This PR is based on musing in #603.

Keep the arrays containing machine code alive by storing references to them in a Lua table. Otherwise they will be garbage collected. dynasm uses a GC callback to unmap memory that was used for generated code and so the most likely consequence is a segfault. Here is how it looks in dmesg: segfault at 7fe0d50de000 ip 00007fe0d50de000 sp 00007ffcd2c89cb8 error 14 where "error 14" means an error during instruction fetch. This problem triggered immediately when using the pmu library with non-trivial code under test (running an app network).

Renamed 'ref-cycles' to 'ref_cycles' both for consistency with other counters and to make it easier to use as a Lua table key. report() now takes a table for input rather than a counterset. This makes it easy to write Lua code that manipulates data (e.g. taking deltas from separate runs) and then call report() to format it. report() now lexically sorts the counters based on their names, with the exception of the fixed-purpose counters (cycles, instructions, ref_cycles) that are printed first in a fixed order. This is intended to increase consistency and make it slightly easier to compare results by eyeball.

appbench measures the impact on CPU performance counters when a new app is introduced to an app network between a Source and a Sink. This can be used to analyze the behavior of an individual app: for example how many cycles per packet it consumes, how many accesses to each level of the cache heirarchy per packet, and how many branch mispredictions per packet. Here is the example analysis of the Tee app processing 100M packets: EVENT TOTAL /packet cycles 1,409,813,298 14.098 ref_cycles 1,056,440,016 10.564 instructions 3,858,706,478 38.587 br_misp_retired.all_branches 361,693 0.004 mem_load_uops_retired.l1_hit 1,324,505,106 13.245 mem_load_uops_retired.l2_hit 109,539,516 1.095 mem_load_uops_retired.l3_hit 180,222 0.002 mem_load_uops_retired.l3_miss 0 0.000 packet 100,000,000 1.000

This is a quick change to make the 'snabbmark basic1' benchmark print per-app PMU counters. It also comments out one of the links on the 'Tee' app to make the basic1 benchmark more directly comparable with the benchmark in snabbco#615.

lukego · 2016-02-07T04:34:18Z

Closing PR: I am not actively developing this branch.

lukego added 4 commits September 13, 2015 11:27

keyed_ipv6_tunnel: Skip missing links (no error)

60eff30

lukego mentioned this pull request Sep 13, 2015

engine.report_pmu() [WIP] #616

Closed

lukego closed this Feb 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

appbench: Detailed per-app performance analysis [WIP] #615

appbench: Detailed per-app performance analysis [WIP] #615

lukego commented Sep 13, 2015

lukego commented Feb 7, 2016

appbench: Detailed per-app performance analysis [WIP] #615

appbench: Detailed per-app performance analysis [WIP] #615

Conversation

lukego commented Sep 13, 2015

Summary

lukego commented Feb 7, 2016