Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] cut 60% time for PerfParser::filterResults #574

Merged
merged 1 commit into from
Dec 19, 2023

Conversation

MuchToMyDelight
Copy link
Contributor

When change to "aggregate cost by symbol", filterResults spends a lot of time doing string comparison. This patch will reduce it dramatically.

The base version: origin/master, notice that operator== takes 50.6% time. total time 24.x seconds
image

The optimized version: the cost of operator== has gone, the time is reduced to 10.x seconds
image

@MuchToMyDelight
Copy link
Contributor Author

@milianw
Compared to #574 , this PR achieves the same improvement with very little change.
On the hand, making symbol struct read-only also makes sense.

@GitMensch
Copy link
Contributor

That's quite nice, I'd prefer this one (but I'm only a mere user).

@MuchToMyDelight
Copy link
Contributor Author

That's quite nice, I'd prefer this one (but I'm only a mere user).

Making symbols read-only has some advantages as well, the first one is that parallel processing becomes much easier.

@GitMensch
Copy link
Contributor

@lievenhey can you please approve the workflow run? That would allow me to easily test the results.

@milianw
Copy link
Member

milianw commented Dec 13, 2023

this would break aggregation for inlined symbols, no? by including the relative address we cannot aggregate the same function appearing at different locations.

we really do want to do symbol based aggregation here after all... can you share some more information on how to reproduce this issue please? for example, why do you suffer so much from the string comparison, do you have so many symbols of the same size that share a long prefix?

@MuchToMyDelight
Copy link
Contributor Author

this would break aggregation for inlined symbols, no? by including the relative address we cannot aggregate the same function appearing at different locations.

we really do want to do symbol based aggregation here after all... can you share some more information on how to reproduce this issue please? for example, why do you suffer so much from the string comparison, do you have so many symbols of the same size that share a long prefix?

Hello @milianw ! This shouldn't break anything since I didn't add realAddr to the comparison but reorder it. In previous implementation the relative address is already used in the comparison: if symbol a == symbol b, then their symbol/binary/path/relAddr should all be compared. I just reorder it to the first to speed it up.

I'm doing system-wide profiling in the period of 10s to 1 minute and the perf.data is roughly 200MB. I haven't checked the features of these string comparison, but I think this is not a extreme case, actually I think it's quite common for people doing system-wide profiling.

@milianw
Copy link
Member

milianw commented Dec 14, 2023

This shouldn't break anything since I didn't add realAddr to the comparison but reorder it.

Oh boy, sorry for my misunderstanding yesterday, I blame my tiredness - it was late! You are totally right, this won't change anything.

Can you please do me a favor: do not reorder the members in memory, only change the order of comparison by moving the numeric entries up front in the std::tie? Then this can go in!

I still am mystified why you get such large costs in symbol string comparison though... With qdebug/debugging/etc. - can you get a feeling for how many strings are being checked here? A QString== should be plenty fast assuming the strings are actually different after some point. But in your case it reminds me of a scenario where (in a different project) I compared strings that where large and that shared a large prefix string and only differed in the last few bytes of the string... Is that the situation for you? Maybe lots of functions returning e.g. std::string and that gets expanded and then makes up the prefix for all these symbols or something like that?

But one way or another, your suggestion of reordering operator== is really good!

Thanks

@MuchToMyDelight
Copy link
Contributor Author

Tested three version: 1. base: origin/master, 2. reorder .relAddr in operator == only, 3. reorder both operator== and data member
image
Between 2 and 3 the difference exists but not very big. The variance is bigger than the perf gap: the best result of version 2(round 3: 6374.52ms) is almost as good as the best result of version 3(round 4: 6363.06ms). To measure it accurately, a multiple-run micro benchmark is needed(something like using google/benchmark).

I believe reoder the data member does nothing harm and in theory it should be faster: the strut Symbol is not serialized or deserialized which means there is no data/abi compatibility concerns. Also, changing it doesn't break anything from the source code level.

In terms of why there are so many string comparison, I think the reason is that there are quit a bit symbols are unresolved and there is no information about their binary path. In the image below, top symbols are ??
image

I noticed that even with this PR, the time is still quit long for the reason that most part of it is single-threaded. I found there is a todo to parallelize it in filterResults. Is there anything blocking that? If not, I could work on another PR to improve it.

@GitMensch
Copy link
Contributor

Thank you for your tests.

I've also seen quite more time needed if more debug symbols are missing. Maybe it would be possible to add a simple Boolean missingDebuginfo and have a shortcut in this way.

Is the string comparison done in the adjusted function or in other places?

It would be quite nice if you find the time to check the option to do that in parallel.

@MuchToMyDelight
Copy link
Contributor Author

Checking missingDebuginfo would make things complicated since comparing an integer(relAddr) first costs very tiny and it works for both missing or not missing cases. In terms of efficiency, the cost of checking a boolean value and checking two integers are almost the same. So using a shortcut based on an extra flag makes code more complex and inefficient.

Yes, all string comparison happens in filterReulsts -> "add event data to cpus, bottom up and caller callee sets" -> addEvent -> entryForSymbol.

I use hotspot a lot and tt's a fantastic tool, easy & convenient to use. Aslo, releasing an .AppImage is a brilliant idea! I used it on various Linux distribution and even WSL. It saved me a lot time and effort! Let's merge this PR and I'll find the time to submit another one about the parallel issue.

@milianw
Copy link
Member

milianw commented Dec 18, 2023

Sorry, but I'm still of the opinion that the data members don't need to be reordered. I dislike that the offset + size are now not contiguous in memory anymore. As your measurement has shown - the data layout in memory doesn't matter for this - so can you please follow my subjective choice and only change the behavior/order in operator==?

Regarding the multithreading: Nothing should really block this, outside of the inherent complexity in parallelizing stuff. You are more than welcome in giving that a go!

@milianw
Copy link
Member

milianw commented Dec 18, 2023

And I realize I forgot to write one more thing: Missing debug symbols shouldn't in theory explain this. They have empty symbols after all, which are fast to compare. That then leaves the DSO name, which should only contain the library name which should be easy and fast to compare.

Are we maybe breaking / not leveraging the COW/implicit sharing nature of same strings for the DSO name? I would really like to get my hands on such a perf.data file, can you share it with me maybe?

When change to "aggregate cost by symbol", filterResults spends a lot
of time doing string comparison. This patch will reduce it dramatically.

Test on a 240MB perf.data/6 core CPU. The time is reduced from 50.x seconds
to 19.x seconds.

Signed-off-by: Jason L <5919474+MuchToMyDelight@users.noreply.github.com>
@MuchToMyDelight
Copy link
Contributor Author

updated as PR to reorder the the comparison only.

Regarding the perf.data, it's can be re-generated like this: building the hotspot itself in parallel(any big repo should be fine, I use hotspot itself as an example), and in the meantime run perf: "sudo perf record -ag -- sleep 30" to generate the perf.data. I didn't set anything special, so it should be easy to reproduce it.

@milianw
Copy link
Member

milianw commented Dec 19, 2023

thanks, I'll see if I can find the time to replicate this setup. just to confirm: are you also on a distro like mine where framepointers are disabled? meaning the above command is broken as it will try to use that and then just create totally bogus backtraces.

or are you on e.g. the latest fedora with framepointers?

@milianw milianw merged commit e4fa78b into KDAB:master Dec 19, 2023
12 checks passed
@MuchToMyDelight
Copy link
Contributor Author

Yes, I think frame pointers are omitted since I'm using Arch and doesn't change the default settings. I guess the latest fedora is the only one with frame pointers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants