Loading a trace is much slower in 1.2 than in 1.1 #192

jcelerier · 2019-05-23T05:46:35Z

I recorded a trace doing the same actions with 1.1 and 1.2 - loading it takes roughly five seconds with 1.1, while it's been a it for a few minutes for 1.2 and is only at 10%.

The trace is relatively large (700 megabytes).

milianw · 2019-05-27T09:42:31Z

recording (i.e. running perf) or analyzing (i.e. running hotspot / hotspot-perfparser) is slower?

you should only need to record once and then analyze the same data file - do you still see the performance degradation?

Please share the file with me (either here publically, or through private means) - please also include the dependent DSOs. i.e. do something like this:

perf archive
tar -cjvf perf_data_and_symbols.tar.bz2 perf.data.tar.bz2 perf.data

then send me perf_data_and_symbols.tar.bz2

jcelerier · 2019-06-21T06:53:05Z

just investigated this further and it turns out that the trace generated with 1.1 was missing a lot of symbols for some reason which made it faster to load, but also less useful. Sorry for the noise !

milianw · 2020-06-13T12:36:24Z

reopening, as there are situations where perfparser is excruciatingly slow - also was reported by someone trying to profile firefox

The idea is to find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in them directly. Previously, we relied on `dwarf_getscopes{,_die}` for this, but that potentially needs to traverse the full DWARF die tree and can thus be quite time consuming. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` Performance counter stats for '/home/milian/projects/kdab/rnd/hotspot/build/lib/libexec/hotspot-perfparser --input perf.data.in --output /dev/null': 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` Performance counter stats for '/home/milian/projects/kdab/rnd/hotspot/build/lib/libexec/hotspot-perfparser --input perf.data.in --output /dev/null': 24.283,03 msec task-clock # 1,000 CPUs utilized 94 context-switches # 0,004 K/sec 0 cpu-migrations # 0,000 K/sec 168.432 page-faults # 0,007 M/sec 105.147.091.148 cycles # 4,330 GHz (83,33%) 957.491.830 stalled-cycles-frontend # 0,91% frontend cycles idle (83,33%) 10.502.770.200 stalled-cycles-backend # 9,99% backend cycles idle (83,34%) 295.684.447.780 instructions # 2,81 insn per cycle # 0,04 stalled cycles per insn (83,33%) 71.359.066.393 branches # 2938,639 M/sec (83,34%) 337.918.049 branch-misses # 0,47% of all branches (83,33%) 24,285108368 seconds time elapsed 23,901378000 seconds user 0,318950000 seconds sys ``` So this is looking quite promising. The biggest offender after this patch is then the dwarf_getscopes_die call in prependScopeNames. We can probably add a better caching there too, i.e. we should be able to cache the scope name while building the subprogram range mapping. A follow up patch will look into that. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d

Implement an alternative approach to resolving inline frames and cache subprogram die names. The problem with dwarf_getscopes and dwarf_getscopes_die is that it often has to traverse a large part of the DIE tree within a CU DIE to find its result. For larger DSOs this repeated tree walking can consume a lot of time. The idea behind this patch is the following: We find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in that subtree directly. Additionally, we cache the scope names within a CU DIE more aggressively. I hope that in the future this area can be improved even further, as the remaining calls to dwarf_getscopes_die still make up a large fraction of the overall analysis cost. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` 15.558,09 msec task-clock # 1,000 CPUs utilized 99 context-switches # 0,006 K/sec 1 cpu-migrations # 0,000 K/sec 151.446 page-faults # 0,010 M/sec 67.961.461.389 cycles # 4,368 GHz (83,32%) 759.299.629 stalled-cycles-frontend # 1,12% frontend cycles idle (83,32%) 7.369.116.441 stalled-cycles-backend # 10,84% backend cycles idle (83,34%) 187.648.727.850 instructions # 2,76 insn per cycle # 0,04 stalled cycles per insn (83,34%) 45.231.315.052 branches # 2907,254 M/sec (83,34%) 200.377.846 branch-misses # 0,44% of all branches (83,33%) 15,560370834 seconds time elapsed 15,230305000 seconds user 0,285776000 seconds sys ``` Which means we are now roughly 5x faster than before, which is a pretty significant gain. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d

This should significantly reduce the time required to analyze perf.data files for applications with large amounts of dwarf debug data. In my local test, perfparser is roughly 5x faster now. Relates-To: #192

Implement an alternative approach to resolving inline frames and cache subprogram die names. The problem with dwarf_getscopes and dwarf_getscopes_die is that it often has to traverse a large part of the DIE tree within a CU DIE to find its result. For larger DSOs this repeated tree walking can consume a lot of time. The idea behind this patch is the following: We find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in that subtree directly. Additionally, we cache the scope names within a CU DIE more aggressively. I hope that in the future this area can be improved even further, as the remaining calls to dwarf_getscopes_die still make up a large fraction of the overall analysis cost. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` 15.558,09 msec task-clock # 1,000 CPUs utilized 99 context-switches # 0,006 K/sec 1 cpu-migrations # 0,000 K/sec 151.446 page-faults # 0,010 M/sec 67.961.461.389 cycles # 4,368 GHz (83,32%) 759.299.629 stalled-cycles-frontend # 1,12% frontend cycles idle (83,32%) 7.369.116.441 stalled-cycles-backend # 10,84% backend cycles idle (83,34%) 187.648.727.850 instructions # 2,76 insn per cycle # 0,04 stalled cycles per insn (83,34%) 45.231.315.052 branches # 2907,254 M/sec (83,34%) 200.377.846 branch-misses # 0,44% of all branches (83,33%) 15,560370834 seconds time elapsed 15,230305000 seconds user 0,285776000 seconds sys ``` Which means we are now roughly 5x faster than before, which is a pretty significant gain. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d

Implement an alternative approach to resolving inline frames and cache subprogram die names. The problem with dwarf_getscopes and dwarf_getscopes_die is that it often has to traverse a large part of the DIE tree within a CU DIE to find its result. For larger DSOs this repeated tree walking can consume a lot of time. The idea behind this patch is the following: We find (and cache) the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames in that subtree directly. Additionally, we cache the scope names within a CU DIE more aggressively. I hope that in the future this area can be improved even further, as the remaining calls to dwarf_getscopes_die still make up a large fraction of the overall analysis cost. The following shows the performance impact of this patch for a perf.data file with about 6M samples. Many frames in the callstacks point to a self-compiled libclang.so with debug symbols. That library alone is roughly 600MB large. This makes finding inline frames quite slow. Before: ``` 80.159,75 msec task-clock # 0,984 CPUs utilized 4.075 context-switches # 0,051 K/sec 1 cpu-migrations # 0,000 K/sec 152.257 page-faults # 0,002 M/sec 346.071.892.881 cycles # 4,317 GHz (83,33%) 1.940.060.936 stalled-cycles-frontend # 0,56% frontend cycles idle (83,33%) 38.399.679.774 stalled-cycles-backend # 11,10% backend cycles idle (83,34%) 999.298.133.335 instructions # 2,89 insn per cycle # 0,04 stalled cycles per insn (83,31%) 239.561.868.424 branches # 2988,556 M/sec (83,34%) 1.163.589.915 branch-misses # 0,49% of all branches (83,34%) 81,497496973 seconds time elapsed 79,554970000 seconds user 0,404933000 seconds sys ``` After: ``` 15.558,09 msec task-clock # 1,000 CPUs utilized 99 context-switches # 0,006 K/sec 1 cpu-migrations # 0,000 K/sec 151.446 page-faults # 0,010 M/sec 67.961.461.389 cycles # 4,368 GHz (83,32%) 759.299.629 stalled-cycles-frontend # 1,12% frontend cycles idle (83,32%) 7.369.116.441 stalled-cycles-backend # 10,84% backend cycles idle (83,34%) 187.648.727.850 instructions # 2,76 insn per cycle # 0,04 stalled cycles per insn (83,34%) 45.231.315.052 branches # 2907,254 M/sec (83,34%) 200.377.846 branch-misses # 0,44% of all branches (83,33%) 15,560370834 seconds time elapsed 15,230305000 seconds user 0,285776000 seconds sys ``` Which means we are now roughly 5x faster than before, which is a pretty significant gain. Relates-To: KDAB/hotspot#192 Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>

milianw added the bug label May 27, 2019

jcelerier closed this as completed Jun 21, 2019

milianw reopened this Jun 13, 2020

milianw self-assigned this Jun 13, 2020

milianw closed this as completed in fc12536 Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading a trace is much slower in 1.2 than in 1.1 #192

Loading a trace is much slower in 1.2 than in 1.1 #192

jcelerier commented May 23, 2019

milianw commented May 27, 2019

jcelerier commented Jun 21, 2019

milianw commented Jun 13, 2020

Loading a trace is much slower in 1.2 than in 1.1 #192

Loading a trace is much slower in 1.2 than in 1.1 #192

Comments

jcelerier commented May 23, 2019

milianw commented May 27, 2019

jcelerier commented Jun 21, 2019

milianw commented Jun 13, 2020