ARROW-12343: [Rust] Support auto-vectorization for min/max #10002

Dandandan · 2021-04-12T18:53:49Z

This is a PoC to get some more SIMD instructions in Arrow without requiring the nightly compiler or producing not-portable binaries, inspired by this article shared by @jorgecarleitao :
https://www.nickwilcox.com/blog/autovec2/

This allows dispatching on CPU features, letting the compiler auto-vectorize the code without losing portability or introducing unsafe blocks. Here we use the multiversion crate to easily add more compiled versions with different cpu features.

On my machine:

min 512                 time:   [957.16 ns 958.57 ns 960.23 ns]                     
                        change: [-24.964% -24.524% -24.093%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

max 512                 time:   [947.74 ns 949.47 ns 952.04 ns]                     
                        change: [-23.380% -22.955% -22.562%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

jorgecarleitao · 2021-04-12T18:58:21Z

holy molly, 6 lines of code. Brilliant.

How does it compare with our SIMD implementation? (evaluating whether they are needed at all :P)

Dandandan · 2021-04-12T19:06:13Z

@jorgecarleitao

It's definitely not as good as when compiling with the simd feature, that is about 4 as fast as this version (and ~5x as fast as the master branch). This is with --features simd (and nightly compiler, but that doesn't matter for those benchmarks):

min 512                 time:   [232.68 ns 233.24 ns 233.83 ns]                    
                        change: [-75.848% -75.631% -75.432%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

max 512                 time:   [229.37 ns 229.71 ns 230.13 ns]                    
                        change: [-75.484% -75.392% -75.303%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  11 (11.00%) high mild
  7 (7.00%) high severe

github-actions · 2021-04-13T02:04:51Z

https://issues.apache.org/jira/browse/ARROW-12343

alamb

I think this is a great idea @Dandandan

I ran the benchmark tests on my laptop and I saw similar improvement numbers (master is 30% slower than this branch).

I vote we

Given how effective this is, perhaps we can add a similar thing to other kernels (I could file a ticket if so)

FYI @andygrove and @nevi-me

alamb · 2021-04-14T11:26:44Z

holy molly, 6 lines of code. Brilliant.

I 100% agree with this sentiment.

Dandandan · 2021-04-14T18:41:25Z

I think this is a great idea @Dandandan

I ran the benchmark tests on my laptop and I saw similar improvement numbers (master is 30% slower than this branch).

I vote we

Given how effective this is, perhaps we can add a similar thing to other kernels (I could file a ticket if so)

FYI @andygrove and @nevi-me

I think it can be useful to check which kernels / functions can benefit from some more simd instructions. We might also need to play a bit with things like inline attributes, as non-inlined code might not benefit from the same optimizations, as the function in that case is reused for both code paths. Also auto vectorization doesn't always work and as we see here, it is not nearly as effective as thr "manual" implementation (still ~4x faster!).

As we are using packed_simd2, which also uses only the standard instructions set in target-features (i.e. sse2), I believe the same idea could be used there without resorting to emitting instructions unconditionally (like we do for the avx_512 feature at the moment).

alamb · 2021-04-15T10:47:25Z

@nevi-me / @jorgecarleitao any concerns about merging this PR? Only the new dependency is of potential concern to me, but it seems to me to be a pretty reasonable tradeoff

We could probably also just inline the part of the multiversion crate we needed (I bet it has more features than we are using) but it isn't clear to me avoiding the dependency in this case is worth it

paddyhoran

LGTM

Only the new dependency is of potential concern to me, but it seems to me to be a pretty reasonable tradeoff

multiversion is pretty small (not many dependencies) and looks like it could be quite useful in general.

From their README:

Function multiversioning is the practice of compiling multiple versions of a function with various features enabled and safely detecting which version to use at runtime.

My only concern would be on binary size if we really go crazy using it but we can monitor that over time.

returnString · 2021-04-16T16:19:46Z

The multiversion crate seems really handy! Are the tradeoffs just the expected bump in binary size to support the multiple implementations and presumably a branch when entering a multi-target code path? Is there any facility to remove the runtime checks, e.g. if I know I'm producing builds for say aarch64?

nevi-me

I'm happy with the change, I curiously tried this on aarch64, but because that arch isn't stable, it can't work on stable :(

alamb · 2021-04-17T10:16:55Z

It sounds to me like there is consensus on this approach. I'll plan to merge it in as soon as the 4.0 release is out the door (ETA early next week) so as not to potentially introduce instability into the release process at this stage

Dandandan · 2021-04-17T11:39:33Z

The multiversion crate seems really handy! Are the tradeoffs just the expected bump in binary size to support the multiple implementations and presumably a branch when entering a multi-target code path? Is there any facility to remove the runtime checks, e.g. if I know I'm producing builds for say aarch64?

AFAIK the architecture is static/compile time, so it will not be included in the checks. If the architecture is not the target architecture, it shouldn't include the branches for the features and in the generated code.

My only concern would be on binary size if we really go crazy using it but we can monitor that over time.

Agreed. Good to monitor this and try to keep size down 👍

alamb · 2021-04-19T10:23:24Z

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

Dandandan · 2021-04-19T16:05:37Z

Opened in new repo apache/arrow-rs#9

Dandandan added 2 commits April 12, 2021 20:41

Support auto-vectorization for min/max

6cbd3ac

Simplify

018717e

Dandandan changed the title ~~Support auto-vectorization for min/max~~ ARROW-12343: [Rust] Support auto-vectorization for min/max Apr 12, 2021

avx only

e351c65

Remove line

576508d

github-actions bot added the Component: Rust label Apr 13, 2021

alamb approved these changes Apr 14, 2021

View reviewed changes

paddyhoran approved these changes Apr 16, 2021

View reviewed changes

nevi-me approved these changes Apr 16, 2021

View reviewed changes

Dandandan mentioned this pull request Apr 19, 2021

ARROW-12343: [Rust] Support auto-vectorization for min/max apache/arrow-rs#9

Merged

Dandandan closed this Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12343: [Rust] Support auto-vectorization for min/max #10002

ARROW-12343: [Rust] Support auto-vectorization for min/max #10002

Dandandan commented Apr 12, 2021 •

edited

Loading

jorgecarleitao commented Apr 12, 2021

Dandandan commented Apr 12, 2021 •

edited

Loading

github-actions bot commented Apr 13, 2021

alamb left a comment •

edited

Loading

alamb commented Apr 14, 2021

Dandandan commented Apr 14, 2021 •

edited

Loading

alamb commented Apr 15, 2021

paddyhoran left a comment

returnString commented Apr 16, 2021

nevi-me left a comment

alamb commented Apr 17, 2021

Dandandan commented Apr 17, 2021

alamb commented Apr 19, 2021

Dandandan commented Apr 19, 2021

ARROW-12343: [Rust] Support auto-vectorization for min/max #10002

ARROW-12343: [Rust] Support auto-vectorization for min/max #10002

Conversation

Dandandan commented Apr 12, 2021 • edited Loading

jorgecarleitao commented Apr 12, 2021

Dandandan commented Apr 12, 2021 • edited Loading

github-actions bot commented Apr 13, 2021

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Apr 14, 2021

Dandandan commented Apr 14, 2021 • edited Loading

alamb commented Apr 15, 2021

paddyhoran left a comment

Choose a reason for hiding this comment

returnString commented Apr 16, 2021

nevi-me left a comment

Choose a reason for hiding this comment

alamb commented Apr 17, 2021

Dandandan commented Apr 17, 2021

alamb commented Apr 19, 2021

Dandandan commented Apr 19, 2021

Dandandan commented Apr 12, 2021 •

edited

Loading

Dandandan commented Apr 12, 2021 •

edited

Loading

alamb left a comment •

edited

Loading

Dandandan commented Apr 14, 2021 •

edited

Loading