-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std.sort.PartitionPoint: Improve performance #21419
base: master
Are you sure you want to change the base?
std.sort.PartitionPoint: Improve performance #21419
Conversation
No problem at all, I can take a look tomorrow (like T-12 hours) :) |
im not sure how to summarize the speedup in one number in a way that gives an accurate impression, right now its printingthe geometric mean of old_time/new_time over all the sizes but im pretty sure its almost completely meaningless |
Given the nature of this function, I recommend creating three benchmarks. One where the predicate becomes false at the first element, another where that item is in the middle of the list, and the last where it's at the end. I believe that would give the best impression of the performance change. |
seems to be around 30% slower if the branches are all perfectly predictable |
Could you post your benchmark? I don't believe that partitioning (you know what i mean) 10^8 items takes... 20ns. |
of course! ill update the repo, and just to be clear, that benchmark is for the same query over and over which should be perfectly predictable |
and which will be cached since im not evicting the whole array after every query as this makes the benchmark take too long |
4858a71
to
97b0288
Compare
So I spent a bit of time and rewrote your benchmarks a bit. You can find it here: https://github.com/Rexicon226/partition_point_optimization I see that the performance of this new version is pretty consistently slower. I'll leave it up to you to figure out why and how to improve it :). I hope this benchmark helps. |
yes its because you are giving it the same input over and over which means the branchy version is perfectly predictable |
calling binary search with the same key in a loop seems to me to be farther from a typical use case than calling it with inputs that distributed across the whole input slice |
That does not matter. It's flushed before running it. |
my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me |
also the branch at the end* |
The function scales linearly, so it doesn't matter. I am confident that the performance where the point is at
I'm not sure what you're arguing here. The branch predictor has little to do here, and the fact that the same input is used on multiple iterations affects almost nothing. Also please note that you can edit messages on GitHub, there is no need to send 3 at a time. |
@Rexicon226 apologies if I was rude or inconsiderate, could you take another look? |
Sorry, you weren't rude at all. I am simply a bit busy with work atm and can't look deeply enough into your benchmarks to give a good reply yet :) I will certainly get back into this PR on the weekends! |
Alright, sounds good! |
wrote a new version which runs a branch implementation for a few iterations to recover performance on big arrays |
updated the benchmark to have a percentage of queries be unpredictable instead of having all of them fall in the same range. perf on intel i7 12700: (data at https://github.com/JonathanHallstrom/partition_point_optimization/tree/main/12700) perf on ryzen 5 5600x: (data at https://github.com/JonathanHallstrom/partition_point_optimization/tree/main/5600x) |
fairly large speedup for small sizes compared to the old implementation on my machine, basically the same for large arrays, more benchmarking is needed though and any feedback is welcome
@Rexicon226 could you take a look? (sorry for the ping but you were greatly helpful last time i made a PR)
plot of performance for u32s at time of writing on my machine