appbench: Detailed per-app performance analysis [WIP] #615
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an experimental idea to use the CPU PMU support (#597) to measure the performance impact of introducing a new segment into an app network. This proof-of-concept implementation measures the performance of a reference app network (
Source->Sink
) and then measures the increase caused by extending the app network (Source->[myapp]->Sink
).The result printed estimates the impact of the app on metrics such as cycles per packet, L1/L2/L3 cache accesses per packet, and branch mispredictions per packet.
The app to introduce is given using a minimal command-line syntax:
For example here is the full output when measuring the
Tee
app. Three sets of results are printed: the reference app network (Source->Sink
), the production app network (Source->Tee->Sink
), and the delta of the two (->Tee->
):Here we see an estimate for the
Tee
app of 14 cycles per packet, 13 L1 cache accesses per packet, and one L2 cache access per packet.Here is the abbreviated (delta only) output for a
pcap_filter
app:The performance here is similar and we can observe one extra L2 cache access. (I wonder what these L2 cache accesses are?)
And for the
keyed_ipv6_tunnel
app:which is more expensive and has much more L1 cache access (moving packet payload in place).
Summary
This is really a proof of concept. I see some good things and some bad things.
Good:
Bad:
Source->Sink
is limited and unrealistic because no I/O is happening. Many apps will have quite different performance depending on whether packet data is in L1/L2/L3/DRAM and this needs to be understood.I would actually quite like to experiment with having the engine automatically sample performance counters for every app on each call to
pull()
andpush()
. This proof-of-concept will be useful to support that work because results could be compared to estimate measurement error (which I see as the main risk for counting each app callback separately).This PR is based on musing in #603.