-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading a trace is much slower in 1.2 than in 1.1 #192
Comments
recording (i.e. running perf) or analyzing (i.e. running hotspot / hotspot-perfparser) is slower? you should only need to record once and then analyze the same data file - do you still see the performance degradation? Please share the file with me (either here publically, or through private means) - please also include the dependent DSOs. i.e. do something like this:
then send me |
just investigated this further and it turns out that the trace generated with 1.1 was missing a lot of symbols for some reason which made it faster to load, but also less useful. Sorry for the noise ! |
reopening, as there are situations where perfparser is excruciatingly slow - also was reported by someone trying to profile firefox |
The idea is to find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in them directly. Previously, we relied on `dwarf_getscopes{,_die}` for this, but that potentially needs to traverse the full DWARF die tree and can thus be quite time consuming. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` Performance counter stats for '/home/milian/projects/kdab/rnd/hotspot/build/lib/libexec/hotspot-perfparser --input perf.data.in --output /dev/null': 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` Performance counter stats for '/home/milian/projects/kdab/rnd/hotspot/build/lib/libexec/hotspot-perfparser --input perf.data.in --output /dev/null': 24.283,03 msec task-clock # 1,000 CPUs utilized 94 context-switches # 0,004 K/sec 0 cpu-migrations # 0,000 K/sec 168.432 page-faults # 0,007 M/sec 105.147.091.148 cycles # 4,330 GHz (83,33%) 957.491.830 stalled-cycles-frontend # 0,91% frontend cycles idle (83,33%) 10.502.770.200 stalled-cycles-backend # 9,99% backend cycles idle (83,34%) 295.684.447.780 instructions # 2,81 insn per cycle # 0,04 stalled cycles per insn (83,33%) 71.359.066.393 branches # 2938,639 M/sec (83,34%) 337.918.049 branch-misses # 0,47% of all branches (83,33%) 24,285108368 seconds time elapsed 23,901378000 seconds user 0,318950000 seconds sys ``` So this is looking quite promising. The biggest offender after this patch is then the dwarf_getscopes_die call in prependScopeNames. We can probably add a better caching there too, i.e. we should be able to cache the scope name while building the subprogram range mapping. A follow up patch will look into that. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d
Implement an alternative approach to resolving inline frames and cache subprogram die names. The problem with dwarf_getscopes and dwarf_getscopes_die is that it often has to traverse a large part of the DIE tree within a CU DIE to find its result. For larger DSOs this repeated tree walking can consume a lot of time. The idea behind this patch is the following: We find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in that subtree directly. Additionally, we cache the scope names within a CU DIE more aggressively. I hope that in the future this area can be improved even further, as the remaining calls to dwarf_getscopes_die still make up a large fraction of the overall analysis cost. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` 15.558,09 msec task-clock # 1,000 CPUs utilized 99 context-switches # 0,006 K/sec 1 cpu-migrations # 0,000 K/sec 151.446 page-faults # 0,010 M/sec 67.961.461.389 cycles # 4,368 GHz (83,32%) 759.299.629 stalled-cycles-frontend # 1,12% frontend cycles idle (83,32%) 7.369.116.441 stalled-cycles-backend # 10,84% backend cycles idle (83,34%) 187.648.727.850 instructions # 2,76 insn per cycle # 0,04 stalled cycles per insn (83,34%) 45.231.315.052 branches # 2907,254 M/sec (83,34%) 200.377.846 branch-misses # 0,44% of all branches (83,33%) 15,560370834 seconds time elapsed 15,230305000 seconds user 0,285776000 seconds sys ``` Which means we are now roughly 5x faster than before, which is a pretty significant gain. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d
This should significantly reduce the time required to analyze perf.data files for applications with large amounts of dwarf debug data. In my local test, perfparser is roughly 5x faster now. Relates-To: #192
Implement an alternative approach to resolving inline frames and cache subprogram die names. The problem with dwarf_getscopes and dwarf_getscopes_die is that it often has to traverse a large part of the DIE tree within a CU DIE to find its result. For larger DSOs this repeated tree walking can consume a lot of time. The idea behind this patch is the following: We find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in that subtree directly. Additionally, we cache the scope names within a CU DIE more aggressively. I hope that in the future this area can be improved even further, as the remaining calls to dwarf_getscopes_die still make up a large fraction of the overall analysis cost. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` 15.558,09 msec task-clock # 1,000 CPUs utilized 99 context-switches # 0,006 K/sec 1 cpu-migrations # 0,000 K/sec 151.446 page-faults # 0,010 M/sec 67.961.461.389 cycles # 4,368 GHz (83,32%) 759.299.629 stalled-cycles-frontend # 1,12% frontend cycles idle (83,32%) 7.369.116.441 stalled-cycles-backend # 10,84% backend cycles idle (83,34%) 187.648.727.850 instructions # 2,76 insn per cycle # 0,04 stalled cycles per insn (83,34%) 45.231.315.052 branches # 2907,254 M/sec (83,34%) 200.377.846 branch-misses # 0,44% of all branches (83,33%) 15,560370834 seconds time elapsed 15,230305000 seconds user 0,285776000 seconds sys ``` Which means we are now roughly 5x faster than before, which is a pretty significant gain. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d
Implement an alternative approach to resolving inline frames and cache subprogram die names. The problem with dwarf_getscopes and dwarf_getscopes_die is that it often has to traverse a large part of the DIE tree within a CU DIE to find its result. For larger DSOs this repeated tree walking can consume a lot of time. The idea behind this patch is the following: We find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in that subtree directly. Additionally, we cache the scope names within a CU DIE more aggressively. I hope that in the future this area can be improved even further, as the remaining calls to dwarf_getscopes_die still make up a large fraction of the overall analysis cost. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` 15.558,09 msec task-clock # 1,000 CPUs utilized 99 context-switches # 0,006 K/sec 1 cpu-migrations # 0,000 K/sec 151.446 page-faults # 0,010 M/sec 67.961.461.389 cycles # 4,368 GHz (83,32%) 759.299.629 stalled-cycles-frontend # 1,12% frontend cycles idle (83,32%) 7.369.116.441 stalled-cycles-backend # 10,84% backend cycles idle (83,34%) 187.648.727.850 instructions # 2,76 insn per cycle # 0,04 stalled cycles per insn (83,34%) 45.231.315.052 branches # 2907,254 M/sec (83,34%) 200.377.846 branch-misses # 0,44% of all branches (83,33%) 15,560370834 seconds time elapsed 15,230305000 seconds user 0,285776000 seconds sys ``` Which means we are now roughly 5x faster than before, which is a pretty significant gain. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
I recorded a trace doing the same actions with 1.1 and 1.2 - loading it takes roughly five seconds with 1.1, while it's been a it for a few minutes for 1.2 and is only at 10%.
The trace is relatively large (700 megabytes).
The text was updated successfully, but these errors were encountered: