[profiling] Add profiling instrumentation centered on the eval loop #144

ericsnowcurrently · 2021-03-23T22:39:33Z

ericsnowcurrently
Mar 23, 2021
Maintainer

Here are the key boundaries:

outside _PyEval_EvalFrameDefault()
between start of _PyEval_EvalFrameDefault() and top of loop (setting up)
between top of loop and dispatch
handling an opcode
between dispatch and bottom of loop
between bottom of loop and returning (cleaning up)

markshannon · 2021-03-24T11:33:24Z

markshannon
Mar 24, 2021
Collaborator

This should mostly be part of the job of the interpreter generator, see #5, IMO.

Adding options to the interpreter generator to insert the instrumentation will be a lot more pleasant than dealing with C macros.

0 replies

ericsnowcurrently · 2021-03-24T15:55:24Z

ericsnowcurrently
Mar 24, 2021
Maintainer Author

Disregarding how we instrument the code, are there any other significant boundaries we should consider?

0 replies

ericsnowcurrently · 2021-03-31T21:45:18Z

ericsnowcurrently
Mar 31, 2021
Maintainer Author

As an exercise to better familiarize myself with the knowledge domain, I spent a little time adding some minimal instrumentation (and logging) to the eval loop, as well as a script to analyze the results. See my eval-loop-profiling branch.

The instrumentation generates a log file with a minimal format:

# python: /home/esnow/projects/work/cpython-perf/cpython/python
# argv: ./python -E ./setup.py build
# start time: 1617222445 s (since epoch)
# start clock: 17137493.663947837 s (on clock)
# end clock: 17137494.467647860 s (on clock)

17137493.664739837 0
# func: <module>
17137493.666522537 3
17137493.666523537 5
17137493.666523837 6
17137493.666524037 7
17137493.666524237 8
17137493.666524437 9 100
# func: _DeadlockError
17137493.666531637 3
17137493.666531937 5
17137493.666532137 6
17137493.666532337 7
17137493.666532437 8
17137493.666532637 9 101
17137493.666533537 13
# func: _DeadlockError
17137493.666533937 4
17137493.666534137 2
# func: _ModuleLock
17137493.666562437 3
...
17137494.445492862 13
# func: close
17137494.445493162 4
17137494.445493362 2
# func: release
17137494.445493862 3
17137494.445494062 5
17137494.445494162 6
17137494.445494362 7
17137494.445494562 8
17137494.445494662 9 124
17137494.445495162 13
# func: release
17137494.445495462 4
17137494.445495662 2
17137494.445496162 13
# func: shutdown
17137494.445496462 4
17137494.445496562 2
17137494.467615160 1

For reference, a Python script that has a total runtime of about a second generates roughly 16MB of trace data.

By default, the analysis script makes the log more human-friendly:

$ python3 Tools/scripts/analyze_trace.py
reading from ./eval_loop-1617222445.trace
(showing results in 'simple-indent' format)

# python:         /home/esnow/projects/work/cpython-perf/cpython/python
# argv:           ./python -E ./setup.py build
# start:          2021-03-31 20:27:25.663947 UTC
# end:            2021-03-31 20:27:26.467647 UTC
# elapsed:        803,700.0 µs

####################
# BEGIN TRACE
####################

1,782.7 µs      init
    1.0 µs      enter                                             # func: <module>
    0.3 µs      |  loop enter
    0.2 µs      |  loop slow
    0.2 µs      |  loop fast
    0.2 µs      |  dispatch
    7.2 µs      |  op                     LOAD_CONST (100)
    0.3 µs      |  enter                                          # func: _DeadlockError
    0.2 µs      |  |  loop enter
    0.2 µs      |  |  loop slow
    0.1 µs      |  |  loop fast
    0.2 µs      |  |  dispatch
    0.9 µs      |  |  op                  LOAD_NAME (101)
    0.4 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: _DeadlockError
   28.3 µs      |  runtime other
    0.2 µs      |  enter                                          # func: _ModuleLock
...
    0.3 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: close
    0.5 µs      |  runtime other
    0.2 µs      |  enter                                          # func: release
    0.1 µs      |  |  loop enter
    0.2 µs      |  |  loop slow
    0.2 µs      |  |  loop fast
    0.1 µs      |  |  dispatch
    0.5 µs      |  |  op                  LOAD_FAST (124)
    0.3 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: release
    0.5 µs      |  runtime other
    0.3 µs      |  loop exit
    0.1 µs      exit                                              # func: shutdown
22,118.5 µs     runtime other
                fini

####################
# END TRACE
####################

The script can also provide a summary:

Sorted by Count

$ python3 Tools/scripts/analyze_trace.py --format summary --sort count
reading from ./eval_loop-1617222445.trace
(showing results in 'summary' format)

python:    /home/esnow/projects/work/cpython-perf/cpython/python
argv:      ./python -E ./setup.py build
start:     2021-03-31 20:27:25.663947 UTC
end:       2021-03-31 20:27:26.467647 UTC
elapsed:   803,700.0 µs

events:

           name                count      mean elapsed
--------------------------- ------------ --------------
 dispatch                        94,807         0.1 µs
 op                              94,807         3.4 µs
 loop slow                       61,684         0.2 µs
 loop fast                       61,684         0.1 µs
 enter                           60,382         0.2 µs
 loop exit                       60,382         0.4 µs
 exit                            60,382         0.2 µs
 runtime other                   60,382         5.5 µs
 loop enter                      60,345         0.2 µs
 loop exception                   1,832         0.4 µs
 loop error                       1,821         0.4 µs
 loop exiting                       493         0.2 µs
 init                                 1     1,782.7 µs
 fini                                 1         0.0 µs
--------------------------- ------------ --------------
 total: 14/14

ops:

           name                count      mean elapsed
--------------------------- ------------ --------------
 LOAD_FAST                       24,094         2.0 µs
 POP_JUMP_IF_FALSE               21,937         0.5 µs
 LOAD_GLOBAL                     21,223         3.0 µs
 SETUP_FINALLY                    8,242        21.0 µs
 JUMP_ABSOLUTE                    5,117         0.6 µs
 LOAD_CONST                       2,841         1.6 µs
 BUILD_LIST                       2,640         1.1 µs
 POP_JUMP_IF_TRUE                 2,480         1.4 µs
 FOR_ITER                         1,997         0.7 µs
 POP_TOP                          1,545         0.4 µs
 DUP_TOP                          1,328         1.7 µs
 LOAD_NAME                          550         6.1 µs
 JUMP_FORWARD                       537         0.5 µs
 LOAD_DEREF                          56         1.5 µs
 JUMP_IF_NOT_EXC_MATCH               50         2.7 µs
 JUMP_IF_FALSE_OR_POP                50         1.1 µs
 JUMP_IF_TRUE_OR_POP                 35         2.6 µs
 LOAD_CLOSURE                        25         1.1 µs
 NOP                                 21         0.9 µs
 WITH_EXCEPT_START                   11         0.9 µs
 BUILD_MAP                           11         7.0 µs
 BUILD_SET                            5         4.9 µs
 YIELD_FROM                           4         0.4 µs
 EXTENDED_ARG                         4         0.2 µs
 BUILD_CONST_KEY_MAP                  3        21.4 µs
 SETUP_WITH                           1         9.7 µs
--------------------------- ------------ --------------
 total: 26/125

funcs: 1,025 called (by "simple" name)

Sorted by Average Elapsed Time

$ python3  Tools/scripts/analyze_trace.py --format summary --sort elapsed
reading from ./eval_loop-1617222445.trace
(showing results in 'summary' format)

python:    /home/esnow/projects/work/cpython-perf/cpython/python
argv:      ./python -E ./setup.py build
start:     2021-03-31 20:27:25.663947 UTC
end:       2021-03-31 20:27:26.467647 UTC
elapsed:   803,700.0 µs

events:

           name                count      mean elapsed
--------------------------- ------------ --------------
 init                                 1     1,782.7 µs
 runtime other                   60,382         5.5 µs
 op                              94,807         3.4 µs
 loop error                       1,821         0.4 µs
 loop exit                       60,382         0.4 µs
 loop exception                   1,832         0.4 µs
 enter                           60,382         0.2 µs
 loop exiting                       493         0.2 µs
 exit                            60,382         0.2 µs
 loop enter                      60,345         0.2 µs
 loop slow                       61,684         0.2 µs
 dispatch                        94,807         0.1 µs
 loop fast                       61,684         0.1 µs
 fini                                 1         0.0 µs
--------------------------- ------------ --------------
 total: 14/14

ops:

           name                count      mean elapsed
--------------------------- ------------ --------------
 BUILD_CONST_KEY_MAP                  3        21.4 µs
 SETUP_FINALLY                    8,242        21.0 µs
 SETUP_WITH                           1         9.7 µs
 BUILD_MAP                           11         7.0 µs
 LOAD_NAME                          550         6.1 µs
 BUILD_SET                            5         4.9 µs
 LOAD_GLOBAL                     21,223         3.0 µs
 JUMP_IF_NOT_EXC_MATCH               50         2.7 µs
 JUMP_IF_TRUE_OR_POP                 35         2.6 µs
 LOAD_FAST                       24,094         2.0 µs
 DUP_TOP                          1,328         1.7 µs
 LOAD_CONST                       2,841         1.6 µs
 LOAD_DEREF                          56         1.5 µs
 POP_JUMP_IF_TRUE                 2,480         1.4 µs
 BUILD_LIST                       2,640         1.1 µs
 LOAD_CLOSURE                        25         1.1 µs
 JUMP_IF_FALSE_OR_POP                50         1.1 µs
 WITH_EXCEPT_START                   11         0.9 µs
 NOP                                 21         0.9 µs
 FOR_ITER                         1,997         0.7 µs
 JUMP_ABSOLUTE                    5,117         0.6 µs
 POP_JUMP_IF_FALSE               21,937         0.5 µs
 JUMP_FORWARD                       537         0.5 µs
 YIELD_FROM                           4         0.4 µs
 POP_TOP                          1,545         0.4 µs
 EXTENDED_ARG                         4         0.2 µs
--------------------------- ------------ --------------
 total: 26/125

funcs: 1,025 called (by "simple" name)

0 replies

ericsnowcurrently · 2021-03-31T21:45:27Z

ericsnowcurrently
Mar 31, 2021
Maintainer Author

This effort proved to be quite insightful for me. This isn't something to pursue further right now but will be worth revisiting later when we've added code generation for the eval loop.

0 replies

markshannon · 2021-07-20T11:25:43Z

markshannon
Jul 20, 2021
Collaborator

I think this is worth revisiting.
In order to direct our efforts we need to know where the virtual machine is spending its time.

This is relevant to both startup and general execution.

The text summary sounds useful, could we also get flamegraphs? I think they are a great way to visualize where time is being spent.

We should also generate results using perf or similar tool, to cross check that we are instrumenting the right things.

0 replies

ericsnowcurrently · 2021-07-23T21:52:17Z

ericsnowcurrently
Jul 23, 2021
Maintainer Author

FWIW, I generated a flamegraph for ./python -c 'pass':

https://raw.githubusercontent.com/faster-cpython/ideas/main/flamegraphs/vm-freq999.svg

If you download the file locally you can interact with it in your browser.

The commands (after installing dependencies) look something like this:

#git clone https://github.com/brendangregg/FlameGraph ../FlameGraph

sample_rate_orig=$(cat /proc/sys/kernel/perf_event_max_sample_rate)
max_stack_orig=$(cat /proc/sys/kernel/perf_event_max_stack)
sudo sysctl -w kernel.perf_event_max_sample_rate=100000
sudo sysctl -w kernel.perf_event_max_stack=1000

sudo perf record -F max -a -g -- ./python -c 'pass'
sudo perf script > out.perf
../FlameGraph/stackcollapse-perf.pl out.perf > out.folded
../FlameGraph/flamegraph.pl out.folded > flamegraph.svg

sudo sysctl -w kernel.perf_event_max_sample_rate=$sample_rate_orig
sudo sysctl -w kernel.perf_event_max_stack=$max_stack_orig

See https://github.com/faster-cpython/tools/tree/main/scripts/run-perf.sh for the full operation, including installing dependencies.

0 replies

gvanrossum · 2021-07-24T18:02:32Z

gvanrossum
Jul 24, 2021
Maintainer

Greetings from the Central European Timezone. :-) Is there somewhere that describes what this particular flamegraph means? (I've seen explanations of flamegraphs in general, but I'm not familiar with the tool or methodology used to generate this particular one.)

0 replies

sweeneyde · 2021-07-24T22:17:53Z

sweeneyde
Jul 24, 2021

I ran what amounts to PyPerformance's bench_expand, bench_integrate, bench_sum, and bench_str benchmarks in a loop for about a minute with a PGO ("PGUpdate") build, and Visual Studio's Performance Profiler got me some results.

The hottest 15 functions:

Function Name	Total CPU [unit, %]	Self CPU [unit, %]
_PyEval_EvalFrameDefault	58793 (99.79%)	15770 (26.77%)
call_function	58793 (99.79%)	4045 (6.87%)
_Py_dict_lookup	3262 (5.54%)	3019 (5.12%)
initialize_locals	2329 (3.95%)	1468 (2.49%)
_PyUnicode_FromId	1422 (2.41%)	1422 (2.41%)
_PyFunction_Vectorcall	58793 (99.79%)	1299 (2.20%)
_PyObject_GenericGetAttrWithDict	24025 (40.78%)	1262 (2.14%)
_PyObject_Malloc	1314 (2.23%)	1219 (2.07%)
[External Call] vcruntime140.dll	1076 (1.83%)	1076 (1.83%)
_Py_DECREF	994 (1.69%)	994 (1.69%)
_PyObject_GetMethod	1521 (2.58%)	878 (1.49%)
tupledealloc	996 (1.69%)	878 (1.49%)
frame_dealloc	881 (1.50%)	726 (1.23%)
PyObject_GetAttr	25021 (42.47%)	711 (1.21%)
method_vectorcall	58513 (99.31%)	678 (1.15%)

I also got a Line-by-line-profile of _PyEval_EvalFrameDefault, and VS shows it like this:

I couldn't find a way to make VS dump the output as text, but here are 17 (low-quality, but at least mostly legible) screenshots of the results on the whole function
https://imgur.com/a/WEHwFkr

0 replies

ericsnowcurrently · 2021-07-26T17:25:13Z

ericsnowcurrently
Jul 26, 2021
Maintainer Author

Is there somewhere that describes what this particular flamegraph means?
(I've seen explanations of flamegraphs in general, but I'm not familiar
with the tool or methodology used to generate this particular one.)

Flamegraphs are relatively new to me. After a little digging, I found a good resource: https://www.brendangregg.com/flamegraphs.html. This pointed me to https://github.com/brendangregg/FlameGraph, which is what I used to produce the flame graph. As noted above, it involved generating data in a specific way using the linux "perf" tool and then passing that data through some scripts to produce the SVG file.

(Note that the SVG is better/interactive if you download it and then open the downloaded copy in your browser.)

As to what it means (from the web page):

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth, counting from zero at the bottom. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry. The colors are usually not significant, picked randomly to differentiate frames.

0 replies

ericsnowcurrently · 2021-07-26T20:02:35Z

ericsnowcurrently
Jul 26, 2021
Maintainer Author

I just uploaded better flame graphs (10khz and 100khz), which I generated:

on our bare-metal benchmarking machine (instead of a VM)
non-debug build

0 replies

ericsnowcurrently · 2021-07-26T20:11:57Z

ericsnowcurrently
Jul 26, 2021
Maintainer Author

Observations from the flame graphs:

"perf" should be ignored so all the other percentages are actually proportionally higher
the "swapper" (CPU idle) takes a decent chunk of time - 23%/38% on bare metal and 87% on VM
- presumably this is the amount of time we are blocking on IO (reading from disk)
~~a bunch of our functions are on the call stack during times that "main" is not, for unclear reasons~~
- for example with 10khz, while "main" has 153 samples (37%), other CPython functions (that should have "main" in their stack) have 73 samples (17%)
- I suspect there is a stack depth limit we're hitting which makes it looks like some calls are at the base of the stack when they actually aren't
- [update] I figured out that there's a kernel setting for this that was set to 127 (using 1000 fixed it)
at the highest level, we can break "main()" into two parts: "pymain_init()" and "PyRun_Main()"
- for ./python -c pass we spend much more time in the former (42%) than the latter (7%)
- "pymain_init()" breaks down into "pyinit_core()" (7%) and "pyinit_main()" (34%)

0 replies

ericsnowcurrently · 2021-07-27T18:04:51Z

ericsnowcurrently
Jul 27, 2021
Maintainer Author

I've opened #75 to dive into runtime startup/shutdown profiling. I also created #76 to do the same for our target workloads.

0 replies

ericsnowcurrently · 2021-07-28T23:30:46Z

ericsnowcurrently
Jul 28, 2021
Maintainer Author

We can circle back later (if we need it) to the specific instrumentation described in the original post above.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[profiling] Add profiling instrumentation centered on the eval loop #144

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[profiling] Add profiling instrumentation centered on the eval loop #144

ericsnowcurrently Mar 23, 2021 Maintainer

Replies: 13 comments

markshannon Mar 24, 2021 Collaborator

ericsnowcurrently Mar 24, 2021 Maintainer Author

ericsnowcurrently Mar 31, 2021 Maintainer Author

ericsnowcurrently Mar 31, 2021 Maintainer Author

markshannon Jul 20, 2021 Collaborator

ericsnowcurrently Jul 23, 2021 Maintainer Author

gvanrossum Jul 24, 2021 Maintainer

sweeneyde Jul 24, 2021

ericsnowcurrently Jul 26, 2021 Maintainer Author

ericsnowcurrently Jul 26, 2021 Maintainer Author

ericsnowcurrently Jul 26, 2021 Maintainer Author

ericsnowcurrently Jul 27, 2021 Maintainer Author

ericsnowcurrently Jul 28, 2021 Maintainer Author

ericsnowcurrently
Mar 23, 2021
Maintainer

markshannon
Mar 24, 2021
Collaborator

ericsnowcurrently
Mar 24, 2021
Maintainer Author

ericsnowcurrently
Mar 31, 2021
Maintainer Author

ericsnowcurrently
Mar 31, 2021
Maintainer Author

markshannon
Jul 20, 2021
Collaborator

ericsnowcurrently
Jul 23, 2021
Maintainer Author

gvanrossum
Jul 24, 2021
Maintainer

sweeneyde
Jul 24, 2021

ericsnowcurrently
Jul 26, 2021
Maintainer Author

ericsnowcurrently
Jul 26, 2021
Maintainer Author

ericsnowcurrently
Jul 26, 2021
Maintainer Author

ericsnowcurrently
Jul 27, 2021
Maintainer Author

ericsnowcurrently
Jul 28, 2021
Maintainer Author