[enhancement] cut 60% time for PerfParser::filterResults #574

MuchToMyDelight · 2023-12-13T00:24:49Z

When change to "aggregate cost by symbol", filterResults spends a lot of time doing string comparison. This patch will reduce it dramatically.

The base version: origin/master, notice that operator== takes 50.6% time. total time 24.x seconds

The optimized version: the cost of operator== has gone, the time is reduced to 10.x seconds

MuchToMyDelight · 2023-12-13T00:29:47Z

@milianw
Compared to #574 , this PR achieves the same improvement with very little change.
On the hand, making symbol struct read-only also makes sense.

GitMensch · 2023-12-13T01:20:41Z

That's quite nice, I'd prefer this one (but I'm only a mere user).

MuchToMyDelight · 2023-12-13T02:10:02Z

That's quite nice, I'd prefer this one (but I'm only a mere user).

Making symbols read-only has some advantages as well, the first one is that parallel processing becomes much easier.

GitMensch · 2023-12-13T15:57:44Z

@lievenhey can you please approve the workflow run? That would allow me to easily test the results.

milianw · 2023-12-13T20:31:41Z

this would break aggregation for inlined symbols, no? by including the relative address we cannot aggregate the same function appearing at different locations.

we really do want to do symbol based aggregation here after all... can you share some more information on how to reproduce this issue please? for example, why do you suffer so much from the string comparison, do you have so many symbols of the same size that share a long prefix?

MuchToMyDelight · 2023-12-14T02:57:11Z

this would break aggregation for inlined symbols, no? by including the relative address we cannot aggregate the same function appearing at different locations.

we really do want to do symbol based aggregation here after all... can you share some more information on how to reproduce this issue please? for example, why do you suffer so much from the string comparison, do you have so many symbols of the same size that share a long prefix?

Hello @milianw ! This shouldn't break anything since I didn't add realAddr to the comparison but reorder it. In previous implementation the relative address is already used in the comparison: if symbol a == symbol b, then their symbol/binary/path/relAddr should all be compared. I just reorder it to the first to speed it up.

I'm doing system-wide profiling in the period of 10s to 1 minute and the perf.data is roughly 200MB. I haven't checked the features of these string comparison, but I think this is not a extreme case, actually I think it's quite common for people doing system-wide profiling.

milianw · 2023-12-14T19:22:37Z

This shouldn't break anything since I didn't add realAddr to the comparison but reorder it.

Oh boy, sorry for my misunderstanding yesterday, I blame my tiredness - it was late! You are totally right, this won't change anything.

Can you please do me a favor: do not reorder the members in memory, only change the order of comparison by moving the numeric entries up front in the std::tie? Then this can go in!

I still am mystified why you get such large costs in symbol string comparison though... With qdebug/debugging/etc. - can you get a feeling for how many strings are being checked here? A QString== should be plenty fast assuming the strings are actually different after some point. But in your case it reminds me of a scenario where (in a different project) I compared strings that where large and that shared a large prefix string and only differed in the last few bytes of the string... Is that the situation for you? Maybe lots of functions returning e.g. std::string and that gets expanded and then makes up the prefix for all these symbols or something like that?

But one way or another, your suggestion of reordering operator== is really good!

Thanks

MuchToMyDelight · 2023-12-16T04:34:24Z

Tested three version: 1. base: origin/master, 2. reorder .relAddr in operator == only, 3. reorder both operator== and data member

Between 2 and 3 the difference exists but not very big. The variance is bigger than the perf gap: the best result of version 2(round 3: 6374.52ms) is almost as good as the best result of version 3(round 4: 6363.06ms). To measure it accurately, a multiple-run micro benchmark is needed(something like using google/benchmark).

I believe reoder the data member does nothing harm and in theory it should be faster: the strut Symbol is not serialized or deserialized which means there is no data/abi compatibility concerns. Also, changing it doesn't break anything from the source code level.

In terms of why there are so many string comparison, I think the reason is that there are quit a bit symbols are unresolved and there is no information about their binary path. In the image below, top symbols are ??

I noticed that even with this PR, the time is still quit long for the reason that most part of it is single-threaded. I found there is a todo to parallelize it in filterResults. Is there anything blocking that? If not, I could work on another PR to improve it.

GitMensch · 2023-12-16T16:00:35Z

Thank you for your tests.

I've also seen quite more time needed if more debug symbols are missing. Maybe it would be possible to add a simple Boolean missingDebuginfo and have a shortcut in this way.

Is the string comparison done in the adjusted function or in other places?

It would be quite nice if you find the time to check the option to do that in parallel.

MuchToMyDelight · 2023-12-16T16:50:29Z

Checking missingDebuginfo would make things complicated since comparing an integer(relAddr) first costs very tiny and it works for both missing or not missing cases. In terms of efficiency, the cost of checking a boolean value and checking two integers are almost the same. So using a shortcut based on an extra flag makes code more complex and inefficient.

Yes, all string comparison happens in filterReulsts -> "add event data to cpus, bottom up and caller callee sets" -> addEvent -> entryForSymbol.

I use hotspot a lot and tt's a fantastic tool, easy & convenient to use. Aslo, releasing an .AppImage is a brilliant idea! I used it on various Linux distribution and even WSL. It saved me a lot time and effort! Let's merge this PR and I'll find the time to submit another one about the parallel issue.

milianw · 2023-12-18T11:57:25Z

Sorry, but I'm still of the opinion that the data members don't need to be reordered. I dislike that the offset + size are now not contiguous in memory anymore. As your measurement has shown - the data layout in memory doesn't matter for this - so can you please follow my subjective choice and only change the behavior/order in operator==?

Regarding the multithreading: Nothing should really block this, outside of the inherent complexity in parallelizing stuff. You are more than welcome in giving that a go!

milianw · 2023-12-18T12:02:19Z

And I realize I forgot to write one more thing: Missing debug symbols shouldn't in theory explain this. They have empty symbols after all, which are fast to compare. That then leaves the DSO name, which should only contain the library name which should be easy and fast to compare.

Are we maybe breaking / not leveraging the COW/implicit sharing nature of same strings for the DSO name? I would really like to get my hands on such a perf.data file, can you share it with me maybe?

When change to "aggregate cost by symbol", filterResults spends a lot of time doing string comparison. This patch will reduce it dramatically. Test on a 240MB perf.data/6 core CPU. The time is reduced from 50.x seconds to 19.x seconds. Signed-off-by: Jason L <5919474+MuchToMyDelight@users.noreply.github.com>

MuchToMyDelight · 2023-12-19T03:30:10Z

updated as PR to reorder the the comparison only.

Regarding the perf.data, it's can be re-generated like this: building the hotspot itself in parallel(any big repo should be fine, I use hotspot itself as an example), and in the meantime run perf: "sudo perf record -ag -- sleep 30" to generate the perf.data. I didn't set anything special, so it should be easy to reproduce it.

milianw · 2023-12-19T09:40:03Z

thanks, I'll see if I can find the time to replicate this setup. just to confirm: are you also on a distro like mine where framepointers are disabled? meaning the above command is broken as it will try to use that and then just create totally bogus backtraces.

or are you on e.g. the latest fedora with framepointers?

MuchToMyDelight · 2023-12-20T03:55:25Z

Yes, I think frame pointers are omitted since I'm using Arch and doesn't change the default settings. I guess the latest fedora is the only one with frame pointers?

MuchToMyDelight force-pushed the optimized_v2 branch from bb770a3 to 2225c74 Compare December 13, 2023 00:27

MuchToMyDelight mentioned this pull request Dec 13, 2023

[enhancement] cut 60% time for PerfParser::filterResults #571

Closed

MuchToMyDelight force-pushed the optimized_v2 branch from 2225c74 to 9262fee Compare December 19, 2023 03:25

milianw approved these changes Dec 19, 2023

View reviewed changes

GitMensch mentioned this pull request Dec 19, 2023

applying a filter in huge recordings take very long (QWidget::changeEvent() calling qHash) #534

Closed

milianw merged commit e4fa78b into KDAB:master Dec 19, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enhancement] cut 60% time for PerfParser::filterResults #574

[enhancement] cut 60% time for PerfParser::filterResults #574

MuchToMyDelight commented Dec 13, 2023

MuchToMyDelight commented Dec 13, 2023

GitMensch commented Dec 13, 2023

MuchToMyDelight commented Dec 13, 2023

GitMensch commented Dec 13, 2023

milianw commented Dec 13, 2023

MuchToMyDelight commented Dec 14, 2023

milianw commented Dec 14, 2023

MuchToMyDelight commented Dec 16, 2023

GitMensch commented Dec 16, 2023

MuchToMyDelight commented Dec 16, 2023

milianw commented Dec 18, 2023

milianw commented Dec 18, 2023

MuchToMyDelight commented Dec 19, 2023

milianw commented Dec 19, 2023

MuchToMyDelight commented Dec 20, 2023

[enhancement] cut 60% time for PerfParser::filterResults #574

[enhancement] cut 60% time for PerfParser::filterResults #574

Conversation

MuchToMyDelight commented Dec 13, 2023

MuchToMyDelight commented Dec 13, 2023

GitMensch commented Dec 13, 2023

MuchToMyDelight commented Dec 13, 2023

GitMensch commented Dec 13, 2023

milianw commented Dec 13, 2023

MuchToMyDelight commented Dec 14, 2023

milianw commented Dec 14, 2023

MuchToMyDelight commented Dec 16, 2023

GitMensch commented Dec 16, 2023

MuchToMyDelight commented Dec 16, 2023

milianw commented Dec 18, 2023

milianw commented Dec 18, 2023

MuchToMyDelight commented Dec 19, 2023

milianw commented Dec 19, 2023

MuchToMyDelight commented Dec 20, 2023