Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12343: [Rust] Support auto-vectorization for min/max #10002

Closed
wants to merge 4 commits into from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Apr 12, 2021

This is a PoC to get some more SIMD instructions in Arrow without requiring the nightly compiler or producing not-portable binaries, inspired by this article shared by @jorgecarleitao :
https://www.nickwilcox.com/blog/autovec2/

This allows dispatching on CPU features, letting the compiler auto-vectorize the code without losing portability or introducing unsafe blocks. Here we use the multiversion crate to easily add more compiled versions with different cpu features.

On my machine:

min 512                 time:   [957.16 ns 958.57 ns 960.23 ns]                     
                        change: [-24.964% -24.524% -24.093%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

max 512                 time:   [947.74 ns 949.47 ns 952.04 ns]                     
                        change: [-23.380% -22.955% -22.562%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

@Dandandan Dandandan changed the title Support auto-vectorization for min/max ARROW-12343: [Rust] Support auto-vectorization for min/max Apr 12, 2021
@jorgecarleitao
Copy link
Member

holy molly, 6 lines of code. Brilliant.

How does it compare with our SIMD implementation? (evaluating whether they are needed at all :P)

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 12, 2021

@jorgecarleitao

It's definitely not as good as when compiling with the simd feature, that is about 4 as fast as this version (and ~5x as fast as the master branch). This is with --features simd (and nightly compiler, but that doesn't matter for those benchmarks):

min 512                 time:   [232.68 ns 233.24 ns 233.83 ns]                    
                        change: [-75.848% -75.631% -75.432%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  5 (5.00%) high severe

max 512                 time:   [229.37 ns 229.71 ns 230.13 ns]                    
                        change: [-75.484% -75.392% -75.303%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  11 (11.00%) high mild
  7 (7.00%) high severe

@github-actions
Copy link

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea @Dandandan

I ran the benchmark tests on my laptop and I saw similar improvement numbers (master is 30% slower than this branch).

I vote we :shipit:

Given how effective this is, perhaps we can add a similar thing to other kernels (I could file a ticket if so)

FYI @andygrove and @nevi-me

@alamb
Copy link
Contributor

alamb commented Apr 14, 2021

holy molly, 6 lines of code. Brilliant.

I 100% agree with this sentiment.

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 14, 2021

I think this is a great idea @Dandandan

I ran the benchmark tests on my laptop and I saw similar improvement numbers (master is 30% slower than this branch).

I vote we :shipit:

Given how effective this is, perhaps we can add a similar thing to other kernels (I could file a ticket if so)

FYI @andygrove and @nevi-me

I think it can be useful to check which kernels / functions can benefit from some more simd instructions. We might also need to play a bit with things like inline attributes, as non-inlined code might not benefit from the same optimizations, as the function in that case is reused for both code paths. Also auto vectorization doesn't always work and as we see here, it is not nearly as effective as thr "manual" implementation (still ~4x faster!).

As we are using packed_simd2, which also uses only the standard instructions set in target-features (i.e. sse2), I believe the same idea could be used there without resorting to emitting instructions unconditionally (like we do for the avx_512 feature at the moment).

@alamb
Copy link
Contributor

alamb commented Apr 15, 2021

@nevi-me / @jorgecarleitao any concerns about merging this PR? Only the new dependency is of potential concern to me, but it seems to me to be a pretty reasonable tradeoff

We could probably also just inline the part of the multiversion crate we needed (I bet it has more features than we are using) but it isn't clear to me avoiding the dependency in this case is worth it

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Only the new dependency is of potential concern to me, but it seems to me to be a pretty reasonable tradeoff

multiversion is pretty small (not many dependencies) and looks like it could be quite useful in general.

From their README:

Function multiversioning is the practice of compiling multiple versions of a function with various features enabled and safely detecting which version to use at runtime.

My only concern would be on binary size if we really go crazy using it but we can monitor that over time.

@returnString
Copy link
Contributor

The multiversion crate seems really handy! Are the tradeoffs just the expected bump in binary size to support the multiple implementations and presumably a branch when entering a multi-target code path? Is there any facility to remove the runtime checks, e.g. if I know I'm producing builds for say aarch64?

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the change, I curiously tried this on aarch64, but because that arch isn't stable, it can't work on stable :(

@alamb
Copy link
Contributor

alamb commented Apr 17, 2021

It sounds to me like there is consensus on this approach. I'll plan to merge it in as soon as the 4.0 release is out the door (ETA early next week) so as not to potentially introduce instability into the release process at this stage

@Dandandan
Copy link
Contributor Author

The multiversion crate seems really handy! Are the tradeoffs just the expected bump in binary size to support the multiple implementations and presumably a branch when entering a multi-target code path? Is there any facility to remove the runtime checks, e.g. if I know I'm producing builds for say aarch64?

AFAIK the architecture is static/compile time, so it will not be included in the checks. If the architecture is not the target architecture, it shouldn't include the branches for the features and in the generated code.

My only concern would be on binary size if we really go crazy using it but we can monitor that over time.

Agreed. Good to monitor this and try to keep size down 👍

@alamb
Copy link
Contributor

alamb commented Apr 19, 2021

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

@Dandandan
Copy link
Contributor Author

Opened in new repo apache/arrow-rs#9

@Dandandan Dandandan closed this Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants