Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a trace is much slower in 1.2 than in 1.1 #192

Closed
jcelerier opened this issue May 23, 2019 · 3 comments
Closed

Loading a trace is much slower in 1.2 than in 1.1 #192

jcelerier opened this issue May 23, 2019 · 3 comments
Assignees
Labels

Comments

@jcelerier
Copy link

I recorded a trace doing the same actions with 1.1 and 1.2 - loading it takes roughly five seconds with 1.1, while it's been a it for a few minutes for 1.2 and is only at 10%.

The trace is relatively large (700 megabytes).

@milianw
Copy link
Member

milianw commented May 27, 2019

recording (i.e. running perf) or analyzing (i.e. running hotspot / hotspot-perfparser) is slower?

you should only need to record once and then analyze the same data file - do you still see the performance degradation?

Please share the file with me (either here publically, or through private means) - please also include the dependent DSOs. i.e. do something like this:

perf archive
tar -cjvf perf_data_and_symbols.tar.bz2 perf.data.tar.bz2 perf.data

then send me perf_data_and_symbols.tar.bz2

@milianw milianw added the bug label May 27, 2019
@jcelerier
Copy link
Author

just investigated this further and it turns out that the trace generated with 1.1 was missing a lot of symbols for some reason which made it faster to load, but also less useful. Sorry for the noise !

@milianw
Copy link
Member

milianw commented Jun 13, 2020

reopening, as there are situations where perfparser is excruciatingly slow - also was reported by someone trying to profile firefox

@milianw milianw reopened this Jun 13, 2020
@milianw milianw self-assigned this Jun 13, 2020
milianw added a commit to KDAB/perfparser that referenced this issue Jun 14, 2020
The idea is to find (and cache) the Dwarf_Die's with DW_TAG_subprogram
and then look for inline frames in them directly. Previously, we
relied on `dwarf_getscopes{,_die}` for this, but that potentially
needs to traverse the full DWARF die tree and can thus be quite time
consuming.

The following shows the performance impact of this patch for a
perf.data file with about 6M samples. Many frames in the callstacks
point to a self-compiled libclang.so with debug symbols. That library
alone is roughly 600MB large. This makes finding inline frames quite
slow.

Before:

```
 Performance counter stats for '/home/milian/projects/kdab/rnd/hotspot/build/lib/libexec/hotspot-perfparser --input perf.data.in --output /dev/null':

         80.159,75 msec task-clock                #    0,984 CPUs utilized
             4.075      context-switches          #    0,051 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           152.257      page-faults               #    0,002 M/sec
   346.071.892.881      cycles                    #    4,317 GHz                      (83,33%)
     1.940.060.936      stalled-cycles-frontend   #    0,56% frontend cycles idle     (83,33%)
    38.399.679.774      stalled-cycles-backend    #   11,10% backend cycles idle      (83,34%)
   999.298.133.335      instructions              #    2,89  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,31%)
   239.561.868.424      branches                  # 2988,556 M/sec                    (83,34%)
     1.163.589.915      branch-misses             #    0,49% of all branches          (83,34%)

      81,497496973 seconds time elapsed

      79,554970000 seconds user
       0,404933000 seconds sys
```

After:
```
 Performance counter stats for '/home/milian/projects/kdab/rnd/hotspot/build/lib/libexec/hotspot-perfparser --input perf.data.in --output /dev/null':

         24.283,03 msec task-clock                #    1,000 CPUs utilized
                94      context-switches          #    0,004 K/sec
                 0      cpu-migrations            #    0,000 K/sec
           168.432      page-faults               #    0,007 M/sec
   105.147.091.148      cycles                    #    4,330 GHz                      (83,33%)
       957.491.830      stalled-cycles-frontend   #    0,91% frontend cycles idle     (83,33%)
    10.502.770.200      stalled-cycles-backend    #    9,99% backend cycles idle      (83,34%)
   295.684.447.780      instructions              #    2,81  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,33%)
    71.359.066.393      branches                  # 2938,639 M/sec                    (83,34%)
       337.918.049      branch-misses             #    0,47% of all branches          (83,33%)

      24,285108368 seconds time elapsed

      23,901378000 seconds user
       0,318950000 seconds sys
```

So this is looking quite promising. The biggest offender after this
patch is then the dwarf_getscopes_die call in prependScopeNames.
We can probably add a better caching there too, i.e. we should be
able to cache the scope name while building the subprogram range
mapping. A follow up patch will look into that.

Relates-To: KDAB/hotspot#192
Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d
milianw added a commit to KDAB/perfparser that referenced this issue Jun 14, 2020
Implement an alternative approach to resolving inline frames and
cache subprogram die names.

The problem with dwarf_getscopes and dwarf_getscopes_die is that it
often has to traverse a large part of the DIE tree within a CU DIE
to find its result. For larger DSOs this repeated tree walking can
consume a lot of time.

The idea behind this patch is the following: We find (and cache)
the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames
in that subtree directly.

Additionally, we cache the scope names within a CU DIE more
aggressively. I hope that in the future this area can be improved even
further, as the remaining calls to dwarf_getscopes_die still make up
a large fraction of the overall analysis cost.

The following shows the performance impact of this patch for a
perf.data file with about 6M samples. Many frames in the callstacks
point to a self-compiled libclang.so with debug symbols. That library
alone is roughly 600MB large. This makes finding inline frames quite
slow.

Before:

```
         80.159,75 msec task-clock                #    0,984 CPUs utilized
             4.075      context-switches          #    0,051 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           152.257      page-faults               #    0,002 M/sec
   346.071.892.881      cycles                    #    4,317 GHz                      (83,33%)
     1.940.060.936      stalled-cycles-frontend   #    0,56% frontend cycles idle     (83,33%)
    38.399.679.774      stalled-cycles-backend    #   11,10% backend cycles idle      (83,34%)
   999.298.133.335      instructions              #    2,89  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,31%)
   239.561.868.424      branches                  # 2988,556 M/sec                    (83,34%)
     1.163.589.915      branch-misses             #    0,49% of all branches          (83,34%)

      81,497496973 seconds time elapsed

      79,554970000 seconds user
       0,404933000 seconds sys
```

After:

```
         15.558,09 msec task-clock                #    1,000 CPUs utilized
                99      context-switches          #    0,006 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           151.446      page-faults               #    0,010 M/sec
    67.961.461.389      cycles                    #    4,368 GHz                      (83,32%)
       759.299.629      stalled-cycles-frontend   #    1,12% frontend cycles idle     (83,32%)
     7.369.116.441      stalled-cycles-backend    #   10,84% backend cycles idle      (83,34%)
   187.648.727.850      instructions              #    2,76  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,34%)
    45.231.315.052      branches                  # 2907,254 M/sec                    (83,34%)
       200.377.846      branch-misses             #    0,44% of all branches          (83,33%)

      15,560370834 seconds time elapsed

      15,230305000 seconds user
       0,285776000 seconds sys
```

Which means we are now roughly 5x faster than before, which is a
pretty significant gain.

Relates-To: KDAB/hotspot#192
Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d
milianw added a commit that referenced this issue Jun 14, 2020
This should significantly reduce the time required to analyze
perf.data files for applications with large amounts of dwarf debug
data. In my local test, perfparser is roughly 5x faster now.

Relates-To: #192
milianw added a commit to KDAB/perfparser that referenced this issue Jun 14, 2020
Implement an alternative approach to resolving inline frames and
cache subprogram die names.

The problem with dwarf_getscopes and dwarf_getscopes_die is that it
often has to traverse a large part of the DIE tree within a CU DIE
to find its result. For larger DSOs this repeated tree walking can
consume a lot of time.

The idea behind this patch is the following: We find (and cache)
the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames
in that subtree directly.

Additionally, we cache the scope names within a CU DIE more
aggressively. I hope that in the future this area can be improved even
further, as the remaining calls to dwarf_getscopes_die still make up
a large fraction of the overall analysis cost.

The following shows the performance impact of this patch for a
perf.data file with about 6M samples. Many frames in the callstacks
point to a self-compiled libclang.so with debug symbols. That library
alone is roughly 600MB large. This makes finding inline frames quite
slow.

Before:

```
         80.159,75 msec task-clock                #    0,984 CPUs utilized
             4.075      context-switches          #    0,051 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           152.257      page-faults               #    0,002 M/sec
   346.071.892.881      cycles                    #    4,317 GHz                      (83,33%)
     1.940.060.936      stalled-cycles-frontend   #    0,56% frontend cycles idle     (83,33%)
    38.399.679.774      stalled-cycles-backend    #   11,10% backend cycles idle      (83,34%)
   999.298.133.335      instructions              #    2,89  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,31%)
   239.561.868.424      branches                  # 2988,556 M/sec                    (83,34%)
     1.163.589.915      branch-misses             #    0,49% of all branches          (83,34%)

      81,497496973 seconds time elapsed

      79,554970000 seconds user
       0,404933000 seconds sys
```

After:

```
         15.558,09 msec task-clock                #    1,000 CPUs utilized
                99      context-switches          #    0,006 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           151.446      page-faults               #    0,010 M/sec
    67.961.461.389      cycles                    #    4,368 GHz                      (83,32%)
       759.299.629      stalled-cycles-frontend   #    1,12% frontend cycles idle     (83,32%)
     7.369.116.441      stalled-cycles-backend    #   10,84% backend cycles idle      (83,34%)
   187.648.727.850      instructions              #    2,76  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,34%)
    45.231.315.052      branches                  # 2907,254 M/sec                    (83,34%)
       200.377.846      branch-misses             #    0,44% of all branches          (83,33%)

      15,560370834 seconds time elapsed

      15,230305000 seconds user
       0,285776000 seconds sys
```

Which means we are now roughly 5x faster than before, which is a
pretty significant gain.

Relates-To: KDAB/hotspot#192
Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d
@milianw milianw closed this as completed in fc12536 Jul 1, 2020
milianw added a commit to KDAB/perfparser that referenced this issue Dec 29, 2020
Implement an alternative approach to resolving inline frames and
cache subprogram die names.

The problem with dwarf_getscopes and dwarf_getscopes_die is that it
often has to traverse a large part of the DIE tree within a CU DIE
to find its result. For larger DSOs this repeated tree walking can
consume a lot of time.

The idea behind this patch is the following: We find (and cache)
the Dwarf_Die's with DW_TAG_subprogram and then look for inline frames
in that subtree directly.

Additionally, we cache the scope names within a CU DIE more
aggressively. I hope that in the future this area can be improved even
further, as the remaining calls to dwarf_getscopes_die still make up
a large fraction of the overall analysis cost.

The following shows the performance impact of this patch for a
perf.data file with about 6M samples. Many frames in the callstacks
point to a self-compiled libclang.so with debug symbols. That library
alone is roughly 600MB large. This makes finding inline frames quite
slow.

Before:

```
         80.159,75 msec task-clock                #    0,984 CPUs utilized
             4.075      context-switches          #    0,051 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           152.257      page-faults               #    0,002 M/sec
   346.071.892.881      cycles                    #    4,317 GHz                      (83,33%)
     1.940.060.936      stalled-cycles-frontend   #    0,56% frontend cycles idle     (83,33%)
    38.399.679.774      stalled-cycles-backend    #   11,10% backend cycles idle      (83,34%)
   999.298.133.335      instructions              #    2,89  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,31%)
   239.561.868.424      branches                  # 2988,556 M/sec                    (83,34%)
     1.163.589.915      branch-misses             #    0,49% of all branches          (83,34%)

      81,497496973 seconds time elapsed

      79,554970000 seconds user
       0,404933000 seconds sys
```

After:

```
         15.558,09 msec task-clock                #    1,000 CPUs utilized
                99      context-switches          #    0,006 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           151.446      page-faults               #    0,010 M/sec
    67.961.461.389      cycles                    #    4,368 GHz                      (83,32%)
       759.299.629      stalled-cycles-frontend   #    1,12% frontend cycles idle     (83,32%)
     7.369.116.441      stalled-cycles-backend    #   10,84% backend cycles idle      (83,34%)
   187.648.727.850      instructions              #    2,76  insn per cycle
                                                  #    0,04  stalled cycles per insn  (83,34%)
    45.231.315.052      branches                  # 2907,254 M/sec                    (83,34%)
       200.377.846      branch-misses             #    0,44% of all branches          (83,33%)

      15,560370834 seconds time elapsed

      15,230305000 seconds user
       0,285776000 seconds sys
```

Which means we are now roughly 5x faster than before, which is a
pretty significant gain.

Relates-To: KDAB/hotspot#192
Change-Id: I0669cc3aad886b22165eaf1d0836a56e5183898d
Reviewed-by: Ulf Hermann <ulf.hermann@qt.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants