Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.sort.PartitionPoint: Improve performance #21419

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JonathanHallstrom
Copy link
Contributor

@JonathanHallstrom JonathanHallstrom commented Sep 15, 2024

fairly large speedup for small sizes compared to the old implementation on my machine, basically the same for large arrays, more benchmarking is needed though and any feedback is welcome

@Rexicon226 could you take a look? (sorry for the ping but you were greatly helpful last time i made a PR)

plot of performance for u32s at time of writing on my machine
image

@Rexicon226
Copy link
Contributor

Rexicon226 commented Sep 15, 2024

@Rexicon226 could you take a look? (sorry for the ping but you were greatly helpful last time i made a PR)

No problem at all, I can take a look tomorrow (like T-12 hours) :)

@JonathanHallstrom
Copy link
Contributor Author

im not sure how to summarize the speedup in one number in a way that gives an accurate impression, right now its printingthe geometric mean of old_time/new_time over all the sizes but im pretty sure its almost completely meaningless

@JonathanHallstrom
Copy link
Contributor Author

JonathanHallstrom commented Sep 15, 2024

graph

new graph, same code being benchmarked, just better benchmarking and plotting (fewer bugs)

@Rexicon226
Copy link
Contributor

im not sure how to summarize the speedup in one number in a way that gives an accurate impression, right now its printingthe geometric mean of old_time/new_time over all the sizes but im pretty sure its almost completely meaningless

Given the nature of this function, I recommend creating three benchmarks. One where the predicate becomes false at the first element, another where that item is in the middle of the list, and the last where it's at the end.

I believe that would give the best impression of the performance change.

@JonathanHallstrom
Copy link
Contributor Author

queries uniformly distributed over whole range:
image

queries uniformly distributed over first 16 elements of the range:
image

@JonathanHallstrom
Copy link
Contributor Author

queries all one past the end of the array:
image

@JonathanHallstrom
Copy link
Contributor Author

seems to be around 30% slower if the branches are all perfectly predictable

@Rexicon226
Copy link
Contributor

Rexicon226 commented Sep 16, 2024

Could you post your benchmark? I don't believe that partitioning (you know what i mean) 10^8 items takes... 20ns.

@JonathanHallstrom
Copy link
Contributor Author

of course! ill update the repo, and just to be clear, that benchmark is for the same query over and over which should be perfectly predictable

@JonathanHallstrom
Copy link
Contributor Author

of course! ill update the repo, and just to be clear, that benchmark is for the same query over and over which should be perfectly predictable

and which will be cached since im not evicting the whole array after every query as this makes the benchmark take too long

@JonathanHallstrom
Copy link
Contributor Author

JonathanHallstrom commented Sep 17, 2024

force push was to pull in 812557b which seemed relevant, CI should pass again

@Rexicon226
Copy link
Contributor

So I spent a bit of time and rewrote your benchmarks a bit. You can find it here: https://github.com/Rexicon226/partition_point_optimization

image

I see that the performance of this new version is pretty consistently slower. I'll leave it up to you to figure out why and how to improve it :). I hope this benchmark helps.

@JonathanHallstrom
Copy link
Contributor Author

yes its because you are giving it the same input over and over which means the branchy version is perfectly predictable

@JonathanHallstrom
Copy link
Contributor Author

calling binary search with the same key in a loop seems to me to be farther from a typical use case than calling it with inputs that distributed across the whole input slice

@Rexicon226
Copy link
Contributor

yes its because you are giving it the same input over and over which means the branchy version is perfectly predictable

That does not matter. It's flushed before running it.

@JonathanHallstrom
Copy link
Contributor Author

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

@JonathanHallstrom
Copy link
Contributor Author

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

also the branch at the end*

@JonathanHallstrom
Copy link
Contributor Author

@Rexicon226
Copy link
Contributor

calling binary search with the same key in a loop seems to me to be farther from a typical use case than calling it with inputs that distributed across the whole input slice

The function scales linearly, so it doesn't matter. I am confident that the performance where the point is at N / 2 will be extremely similar to that at (N / 2) - 1. You can add more data points along the slice if you'd like, but I'm sure it'll give you the same answer.

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

I'm not sure what you're arguing here. The branch predictor has little to do here, and the fact that the same input is used on multiple iterations affects almost nothing.

Also please note that you can edit messages on GitHub, there is no need to send 3 at a time.

@JonathanHallstrom
Copy link
Contributor Author

JonathanHallstrom commented Sep 17, 2024

Also please note that you can edit messages on GitHub, there is no need to send 3 at a time.

Sorry about the multiple messages, I should've taken more time to write out a full response.

The function scales linearly, so it doesn't matter.

Not sure what you mean, I'm probably missing something here. Could you elaborate?

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

I'm not sure what you're arguing here. The branch predictor has little to do here, and the fact that the same input is used on multiple iterations affects almost nothing.

The branch predictor having little to do in your benchmark is my point, I don't think consecutive queries should be identical in the benchmark. If consecutive queries are identical then I agree that my implementation is much slower (I measured ~30% as noted earlier).

If the queries instead all fall randomly within a small range, I would expect that the branch predictor would struggle to predict the last few branches, leading to a slowdown. This is supported by my benchmark with the queries uniformly distributed in the first 16 elements showing a speedup.

To more clearly see the effect of more and more branches being unpredictable in the old implementation I ran a series of benchmarks with an increasing range of keys being queried.

Queries on [0, 1) (no unpredictable branches):
image

Queries on [0, 2) (one unpredictable branch):
image

Queries on [0, 4) (two unpredictable branches):
image

Queries on [0, 8) (three unpredictable branches):
image

Queries on [0, 16) (four unpredictable branches):
image

Pattern seems pretty clear to me, old version gets a bit slower with each doubling but the new branchless version stays the same. From a memory perspective all these benchmarks should be the same, so I think the differences are all down to the branch predictor struggling successively more on the old version.

@JonathanHallstrom
Copy link
Contributor Author

JonathanHallstrom commented Sep 19, 2024

@Rexicon226 apologies if I was rude or inconsiderate, could you take another look?

@Rexicon226
Copy link
Contributor

@Rexicon226 apologies if I was rude or inconsiderate, could you take another look?

Sorry, you weren't rude at all. I am simply a bit busy with work atm and can't look deeply enough into your benchmarks to give a good reply yet :) I will certainly get back into this PR on the weekends!

@JonathanHallstrom
Copy link
Contributor Author

Alright, sounds good!

@JonathanHallstrom
Copy link
Contributor Author

plot from a ryzen 5 5600x:
image

going to try to implement prefetching / do a branchy solution for the first few iterations to see if i can recover the performance on the largest arrays

@JonathanHallstrom
Copy link
Contributor Author

JonathanHallstrom commented Sep 21, 2024

https://github.com/JonathanHallstrom/partition_point_optimization/blob/70d340ffd547dac0504e5c5d71da2ec34572d33b/src/main.zig

wrote a new version which runs a branch implementation for a few iterations to recover performance on big arrays

performance on amd 5600x
graph7

performance on intel 12700
graph7

@JonathanHallstrom
Copy link
Contributor Author

updated the benchmark to have a percentage of queries be unpredictable instead of having all of them fall in the same range.

perf on intel i7 12700: (data at https://github.com/JonathanHallstrom/partition_point_optimization/tree/main/12700)
perf_12700

perf on ryzen 5 5600x: (data at https://github.com/JonathanHallstrom/partition_point_optimization/tree/main/5600x)
perf_5600x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants