-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enhancement] cut 60% time for PerfParser::filterResults #574
Conversation
bb770a3
to
2225c74
Compare
That's quite nice, I'd prefer this one (but I'm only a mere user). |
Making symbols read-only has some advantages as well, the first one is that parallel processing becomes much easier. |
@lievenhey can you please approve the workflow run? That would allow me to easily test the results. |
this would break aggregation for inlined symbols, no? by including the relative address we cannot aggregate the same function appearing at different locations. we really do want to do symbol based aggregation here after all... can you share some more information on how to reproduce this issue please? for example, why do you suffer so much from the string comparison, do you have so many symbols of the same size that share a long prefix? |
Hello @milianw ! This shouldn't break anything since I didn't add realAddr to the comparison but reorder it. In previous implementation the relative address is already used in the comparison: if symbol a == symbol b, then their symbol/binary/path/relAddr should all be compared. I just reorder it to the first to speed it up. I'm doing system-wide profiling in the period of 10s to 1 minute and the perf.data is roughly 200MB. I haven't checked the features of these string comparison, but I think this is not a extreme case, actually I think it's quite common for people doing system-wide profiling. |
Oh boy, sorry for my misunderstanding yesterday, I blame my tiredness - it was late! You are totally right, this won't change anything. Can you please do me a favor: do not reorder the members in memory, only change the order of comparison by moving the numeric entries up front in the I still am mystified why you get such large costs in symbol string comparison though... With qdebug/debugging/etc. - can you get a feeling for how many strings are being checked here? A But one way or another, your suggestion of reordering Thanks |
Thank you for your tests. I've also seen quite more time needed if more debug symbols are missing. Maybe it would be possible to add a simple Is the string comparison done in the adjusted function or in other places? It would be quite nice if you find the time to check the option to do that in parallel. |
Checking missingDebuginfo would make things complicated since comparing an integer(relAddr) first costs very tiny and it works for both missing or not missing cases. In terms of efficiency, the cost of checking a boolean value and checking two integers are almost the same. So using a shortcut based on an extra flag makes code more complex and inefficient. Yes, all string comparison happens in filterReulsts -> "add event data to cpus, bottom up and caller callee sets" -> addEvent -> entryForSymbol. I use hotspot a lot and tt's a fantastic tool, easy & convenient to use. Aslo, releasing an .AppImage is a brilliant idea! I used it on various Linux distribution and even WSL. It saved me a lot time and effort! Let's merge this PR and I'll find the time to submit another one about the parallel issue. |
Sorry, but I'm still of the opinion that the data members don't need to be reordered. I dislike that the offset + size are now not contiguous in memory anymore. As your measurement has shown - the data layout in memory doesn't matter for this - so can you please follow my subjective choice and only change the behavior/order in Regarding the multithreading: Nothing should really block this, outside of the inherent complexity in parallelizing stuff. You are more than welcome in giving that a go! |
And I realize I forgot to write one more thing: Missing debug symbols shouldn't in theory explain this. They have empty symbols after all, which are fast to compare. That then leaves the DSO name, which should only contain the library name which should be easy and fast to compare. Are we maybe breaking / not leveraging the COW/implicit sharing nature of same strings for the DSO name? I would really like to get my hands on such a |
When change to "aggregate cost by symbol", filterResults spends a lot of time doing string comparison. This patch will reduce it dramatically. Test on a 240MB perf.data/6 core CPU. The time is reduced from 50.x seconds to 19.x seconds. Signed-off-by: Jason L <5919474+MuchToMyDelight@users.noreply.github.com>
2225c74
to
9262fee
Compare
updated as PR to reorder the the comparison only. Regarding the perf.data, it's can be re-generated like this: building the hotspot itself in parallel(any big repo should be fine, I use hotspot itself as an example), and in the meantime run perf: "sudo perf record -ag -- sleep 30" to generate the perf.data. I didn't set anything special, so it should be easy to reproduce it. |
thanks, I'll see if I can find the time to replicate this setup. just to confirm: are you also on a distro like mine where framepointers are disabled? meaning the above command is broken as it will try to use that and then just create totally bogus backtraces. or are you on e.g. the latest fedora with framepointers? |
Yes, I think frame pointers are omitted since I'm using Arch and doesn't change the default settings. I guess the latest fedora is the only one with frame pointers? |
When change to "aggregate cost by symbol", filterResults spends a lot of time doing string comparison. This patch will reduce it dramatically.
The base version: origin/master, notice that operator== takes 50.6% time. total time 24.x seconds
The optimized version: the cost of operator== has gone, the time is reduced to 10.x seconds