std.sort.PartitionPoint: Improve performance #21419

JonathanHallstrom · 2024-09-15T12:04:32Z

fairly large speedup for small sizes compared to the old implementation on my machine, basically the same for large arrays, more benchmarking is needed though and any feedback is welcome

@Rexicon226 could you take a look? (sorry for the ping but you were greatly helpful last time i made a PR)

plot of performance for u32s at time of writing on my machine

Rexicon226 · 2024-09-15T12:23:27Z

@Rexicon226 could you take a look? (sorry for the ping but you were greatly helpful last time i made a PR)

No problem at all, I can take a look tomorrow (like T-12 hours) :)

JonathanHallstrom · 2024-09-15T14:24:24Z

im not sure how to summarize the speedup in one number in a way that gives an accurate impression, right now its printingthe geometric mean of old_time/new_time over all the sizes but im pretty sure its almost completely meaningless

JonathanHallstrom · 2024-09-15T14:36:40Z

new graph, same code being benchmarked, just better benchmarking and plotting (fewer bugs)

Rexicon226 · 2024-09-15T21:45:28Z

im not sure how to summarize the speedup in one number in a way that gives an accurate impression, right now its printingthe geometric mean of old_time/new_time over all the sizes but im pretty sure its almost completely meaningless

Given the nature of this function, I recommend creating three benchmarks. One where the predicate becomes false at the first element, another where that item is in the middle of the list, and the last where it's at the end.

I believe that would give the best impression of the performance change.

JonathanHallstrom · 2024-09-16T12:26:17Z

queries uniformly distributed over whole range:

queries uniformly distributed over first 16 elements of the range:

JonathanHallstrom · 2024-09-16T12:34:05Z

queries all one past the end of the array:

JonathanHallstrom · 2024-09-16T12:39:41Z

seems to be around 30% slower if the branches are all perfectly predictable

Rexicon226 · 2024-09-16T16:56:18Z

Could you post your benchmark? I don't believe that partitioning (you know what i mean) 10^8 items takes... 20ns.

JonathanHallstrom · 2024-09-17T05:45:06Z

of course! ill update the repo, and just to be clear, that benchmark is for the same query over and over which should be perfectly predictable

JonathanHallstrom · 2024-09-17T06:01:36Z

of course! ill update the repo, and just to be clear, that benchmark is for the same query over and over which should be perfectly predictable

and which will be cached since im not evicting the whole array after every query as this makes the benchmark take too long

JonathanHallstrom · 2024-09-17T06:11:56Z

force push was to pull in 812557b which seemed relevant, CI should pass again

Rexicon226 · 2024-09-17T07:10:45Z

So I spent a bit of time and rewrote your benchmarks a bit. You can find it here: https://github.com/Rexicon226/partition_point_optimization

I see that the performance of this new version is pretty consistently slower. I'll leave it up to you to figure out why and how to improve it :). I hope this benchmark helps.

JonathanHallstrom · 2024-09-17T07:15:45Z

yes its because you are giving it the same input over and over which means the branchy version is perfectly predictable

JonathanHallstrom · 2024-09-17T07:19:38Z

calling binary search with the same key in a loop seems to me to be farther from a typical use case than calling it with inputs that distributed across the whole input slice

Rexicon226 · 2024-09-17T07:19:54Z

yes its because you are giving it the same input over and over which means the branchy version is perfectly predictable

That does not matter. It's flushed before running it.

JonathanHallstrom · 2024-09-17T07:21:14Z

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

JonathanHallstrom · 2024-09-17T07:22:28Z

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

also the branch at the end*

JonathanHallstrom · 2024-09-17T07:22:49Z

see https://godbolt.org/z/G5azjEn63

Rexicon226 · 2024-09-17T07:35:17Z

calling binary search with the same key in a loop seems to me to be farther from a typical use case than calling it with inputs that distributed across the whole input slice

The function scales linearly, so it doesn't matter. I am confident that the performance where the point is at N / 2 will be extremely similar to that at (N / 2) - 1. You can add more data points along the slice if you'd like, but I'm sure it'll give you the same answer.

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

I'm not sure what you're arguing here. The branch predictor has little to do here, and the fact that the same input is used on multiple iterations affects almost nothing.

Also please note that you can edit messages on GitHub, there is no need to send 3 at a time.

JonathanHallstrom · 2024-09-17T08:45:42Z

Also please note that you can edit messages on GitHub, there is no need to send 3 at a time.

Sorry about the multiple messages, I should've taken more time to write out a full response.

The function scales linearly, so it doesn't matter.

Not sure what you mean, I'm probably missing something here. Could you elaborate?

my version of the code only has the loop condition as a branch, not body of the loop, so it seems relevant to me

I'm not sure what you're arguing here. The branch predictor has little to do here, and the fact that the same input is used on multiple iterations affects almost nothing.

The branch predictor having little to do in your benchmark is my point, I don't think consecutive queries should be identical in the benchmark. If consecutive queries are identical then I agree that my implementation is much slower (I measured ~30% as noted earlier).

If the queries instead all fall randomly within a small range, I would expect that the branch predictor would struggle to predict the last few branches, leading to a slowdown. This is supported by my benchmark with the queries uniformly distributed in the first 16 elements showing a speedup.

To more clearly see the effect of more and more branches being unpredictable in the old implementation I ran a series of benchmarks with an increasing range of keys being queried.

Queries on [0, 1) (no unpredictable branches):

Queries on [0, 2) (one unpredictable branch):

Queries on [0, 4) (two unpredictable branches):

Queries on [0, 8) (three unpredictable branches):

Queries on [0, 16) (four unpredictable branches):

Pattern seems pretty clear to me, old version gets a bit slower with each doubling but the new branchless version stays the same. From a memory perspective all these benchmarks should be the same, so I think the differences are all down to the branch predictor struggling successively more on the old version.

JonathanHallstrom · 2024-09-19T06:35:08Z

@Rexicon226 apologies if I was rude or inconsiderate, could you take another look?

Rexicon226 · 2024-09-19T16:38:08Z

@Rexicon226 apologies if I was rude or inconsiderate, could you take another look?

Sorry, you weren't rude at all. I am simply a bit busy with work atm and can't look deeply enough into your benchmarks to give a good reply yet :) I will certainly get back into this PR on the weekends!

JonathanHallstrom · 2024-09-19T17:33:10Z

Alright, sounds good!

JonathanHallstrom · 2024-09-20T11:10:19Z

plot from a ryzen 5 5600x:

going to try to implement prefetching / do a branchy solution for the first few iterations to see if i can recover the performance on the largest arrays

JonathanHallstrom · 2024-09-21T12:58:15Z

https://github.com/JonathanHallstrom/partition_point_optimization/blob/70d340ffd547dac0504e5c5d71da2ec34572d33b/src/main.zig

wrote a new version which runs a branch implementation for a few iterations to recover performance on big arrays

performance on amd 5600x

performance on intel 12700

JonathanHallstrom · 2024-09-27T12:43:37Z

updated the benchmark to have a percentage of queries be unpredictable instead of having all of them fall in the same range.

perf on intel i7 12700: (data at https://github.com/JonathanHallstrom/partition_point_optimization/tree/main/12700)

perf on ryzen 5 5600x: (data at https://github.com/JonathanHallstrom/partition_point_optimization/tree/main/5600x)

make partitionPoint faster by making it branchless

97b0288

JonathanHallstrom force-pushed the faster_partition_point branch from 4858a71 to 97b0288 Compare September 17, 2024 06:10

JonathanHallstrom mentioned this pull request Sep 22, 2024

std.equalRange: Compute lower and upper bounds simultaneously #21471

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.sort.PartitionPoint: Improve performance #21419

std.sort.PartitionPoint: Improve performance #21419

JonathanHallstrom commented Sep 15, 2024 •

edited

Loading

Rexicon226 commented Sep 15, 2024 •

edited

Loading

JonathanHallstrom commented Sep 15, 2024

JonathanHallstrom commented Sep 15, 2024 •

edited

Loading

Rexicon226 commented Sep 15, 2024

JonathanHallstrom commented Sep 16, 2024

JonathanHallstrom commented Sep 16, 2024

JonathanHallstrom commented Sep 16, 2024

Rexicon226 commented Sep 16, 2024 •

edited

Loading

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024 •

edited

Loading

Rexicon226 commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

Rexicon226 commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

Rexicon226 commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024 •

edited

Loading

JonathanHallstrom commented Sep 19, 2024 •

edited

Loading

Rexicon226 commented Sep 19, 2024

JonathanHallstrom commented Sep 19, 2024

JonathanHallstrom commented Sep 20, 2024

JonathanHallstrom commented Sep 21, 2024 •

edited

Loading

JonathanHallstrom commented Sep 27, 2024

std.sort.PartitionPoint: Improve performance #21419

Are you sure you want to change the base?

std.sort.PartitionPoint: Improve performance #21419

Conversation

JonathanHallstrom commented Sep 15, 2024 • edited Loading

Rexicon226 commented Sep 15, 2024 • edited Loading

JonathanHallstrom commented Sep 15, 2024

JonathanHallstrom commented Sep 15, 2024 • edited Loading

Rexicon226 commented Sep 15, 2024

JonathanHallstrom commented Sep 16, 2024

JonathanHallstrom commented Sep 16, 2024

JonathanHallstrom commented Sep 16, 2024

Rexicon226 commented Sep 16, 2024 • edited Loading

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024 • edited Loading

Rexicon226 commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

Rexicon226 commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024

Rexicon226 commented Sep 17, 2024

JonathanHallstrom commented Sep 17, 2024 • edited Loading

JonathanHallstrom commented Sep 19, 2024 • edited Loading

Rexicon226 commented Sep 19, 2024

JonathanHallstrom commented Sep 19, 2024

JonathanHallstrom commented Sep 20, 2024

JonathanHallstrom commented Sep 21, 2024 • edited Loading

JonathanHallstrom commented Sep 27, 2024

JonathanHallstrom commented Sep 15, 2024 •

edited

Loading

Rexicon226 commented Sep 15, 2024 •

edited

Loading

JonathanHallstrom commented Sep 15, 2024 •

edited

Loading

Rexicon226 commented Sep 16, 2024 •

edited

Loading

JonathanHallstrom commented Sep 17, 2024 •

edited

Loading

JonathanHallstrom commented Sep 17, 2024 •

edited

Loading

JonathanHallstrom commented Sep 19, 2024 •

edited

Loading

JonathanHallstrom commented Sep 21, 2024 •

edited

Loading