-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[profiling] Profile/analyze runtime startup (and shutdown). #75
Comments
The 100,000 hz flamegraph for Based on the flame graph (ignoring "perf"),
|
With the "time" command we can see how long
Combining that with the flamegraph info we can approximate clock time spent:
FTR, if we skip the site module:
...and the flamegraph is clearly different. |
Possible improvements:
Other todo items:
|
|
FTR, here are all the symbols with at least one sample, sorted by #samples.
As expected, the results vary with each run, though the outcome is relatively consistent. |
Also, I took a look at what shared objects are loaded:
|
For completeness, here's strace info:
|
FWIW, we can use loop to stabilize perf result:
|
I've also used uftrace to get some data. (for "python -c pass") all calls, ordered by usage(sorted by "self" time: not counting children)
(with percentages and totals)
the call graph
(with percentages and totals)
Note that the total time is much smaller than expected, given that normal un-traced execution took 17ms. Possibilities:
Regardless, the results remain informative. |
Searching the 100 kHz flamegraph for "r_object" (the workhorse of marshal.loads) I see that quite a large fraction of time is spent here (it's in several separate piles). I didn't add the percentages but it looks close enough to match @nascheme's estimate of 27%. Of course, the initialization doesn't have a lot of unused code (I saw a little fluff in site.py) so making marshal.loads lazy won't help much. (OTOH making it faster, e.g. via Experiment B #65, would pay off). The scheme we discussed this morning (issue TBD) might be even more effective. OTOH, for some low-hanging fruit, @iritkatriel first remarked that it's spending a lot of time in GC. This effect seems the strongest in the pile on top of get_frozen_object() -- it spends 60% of its time in PyTuple_New() and most of that is in gc_collect_main(). Now, get_frozen_object() only represents 2.62% of the total (main() is 48%). But still. Maybe gc just isn't well-tuned for quickly loading code objects? Or maybe it isn't tuned at all? |
@ericsnowcurrently Would a flamegraph with more samples/sec (e.g. 10,000 Hz) be useful? I'm worried that we have only ~2000 samples in total. |
We did 100,000 hz. We could certainly go higher. (I was hoping the data generated by uftrace would be exhaustive, and thus give us the full picture, but it's not giving me the results I'm expecting.) |
FYI I meant e.g. 1,000,000 Hz. 10,000 Hz would be lower resolution, and result in only 200 samples. Flame graphs are hard! |
Profiling is hard to do. I think there are some problems with increasing the frequency too high. Definitely you are influencing the results due to the extra overhead of saving the profile stats. To get more samples, I created a loop that ran the startup hundreds or thousands of times. That has its own problems since the effect of various caches is changed on repeated runs like that (i.e. in a tight loop). |
Once deep-freeze lands it would be nice to regenerate the flame graph to see if there are now new bottlenecks that we could attack. |
We now have profiling information: https://github.com/faster-cpython/benchmarking-public/tree/main/profiling |
The per-benchmark profiles for the "startup" benchmarks should give us the information we need. |
There's little guidance about what the data files found there mean, or even how one would visualize them. |
Fine-grained profile info for startup will be helpful for knowing where to focus effort. Steps:
The text was updated successfully, but these errors were encountered: