-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make [u8]::reverse() 5x faster #41764
Conversation
Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5. Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16. Thank you ptr::*_unaligned for making this easy :)
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @brson (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
This is a neat patch! It combines several low level abstractions to do something very precise and unusual, and it's written clearly and all the abstractions disappear into a blip of assembly. Good Rust. If you think the compiler should be doing this optimization (or better) itself maybe add a comment saying so, in case someday it should be removed. It looks right to me but somebody else should review. Maybe r? @bluss? |
src/libcollections/benches/slice.rs
Outdated
reverse!(reverse_u8, u8); | ||
reverse!(reverse_u16, u16); | ||
reverse!(reverse_u32, u32); | ||
reverse!(reverse_u64, u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should u128
also be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to have all the primitives. Also added [u8;3]
and Simd<[f64;4]>
while I was at it, to show more of the perf range.
Results, from fastest to slowest:
test slice::reverse_simd_f64x4 ... bench: 36,818 ns/iter (+/- 924) = 28479 MB/s
test slice::reverse_u128 ... bench: 41,797 ns/iter (+/- 3,127) = 25087 MB/s
test slice::reverse_u64 ... bench: 47,062 ns/iter (+/- 898) = 22280 MB/s
test slice::reverse_u8 ... bench: 51,678 ns/iter (+/- 3,819) = 20290 MB/s
test slice::reverse_u32 ... bench: 74,404 ns/iter (+/- 387) = 14092 MB/s
test slice::reverse_u16 ... bench: 92,952 ns/iter (+/- 2,385) = 11280 MB/s
test slice::reverse_u8x3 ... bench: 181,223 ns/iter (+/- 6,541) = 5786 MB/s
None of these are affected by e8fad32.
@bluss are you planning on reviewing this PR? |
@bors r+ |
📌 Commit da91361 has been approved by |
Make [u8]::reverse() 5x faster Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5. Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16. Thank you ptr::*_unaligned for making this easy :) Benchmark results (from my i5-2500K): ```text test slice::reverse_u8 ... bench: 273,836 ns/iter (+/- 15,592) = 3829 MB/s test slice::reverse_u16 ... bench: 139,793 ns/iter (+/- 17,748) = 7500 MB/s test slice::reverse_u32 ... bench: 74,997 ns/iter (+/- 5,130) = 13981 MB/s test slice::reverse_u64 ... bench: 47,452 ns/iter (+/- 2,213) = 22097 MB/s test slice::reverse_u8 ... bench: 52,170 ns/iter (+/- 3,962) = 20099 MB/s test slice::reverse_u16 ... bench: 93,330 ns/iter (+/- 4,412) = 11235 MB/s test slice::reverse_u32 ... bench: 74,731 ns/iter (+/- 1,425) = 14031 MB/s test slice::reverse_u64 ... bench: 47,556 ns/iter (+/- 3,025) = 22049 MB/s ``` If you're curious about the assembly, instead of doing this ``` movzx eax, byte ptr [rdi] movzx ecx, byte ptr [rsi] mov byte ptr [rdi], cl mov byte ptr [rsi], al ``` it does this ``` mov rax, qword ptr [rdx] mov rbx, qword ptr [r11 + rcx - 8] bswap rbx mov qword ptr [rdx], rbx bswap rax mov qword ptr [r11 + rcx - 8], rax ```
Make [u8]::reverse() 5x faster Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5. Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16. Thank you ptr::*_unaligned for making this easy :) Benchmark results (from my i5-2500K): ```text # Before test slice::reverse_u8 ... bench: 273,836 ns/iter (+/- 15,592) = 3829 MB/s test slice::reverse_u16 ... bench: 139,793 ns/iter (+/- 17,748) = 7500 MB/s test slice::reverse_u32 ... bench: 74,997 ns/iter (+/- 5,130) = 13981 MB/s test slice::reverse_u64 ... bench: 47,452 ns/iter (+/- 2,213) = 22097 MB/s # After test slice::reverse_u8 ... bench: 52,170 ns/iter (+/- 3,962) = 20099 MB/s test slice::reverse_u16 ... bench: 93,330 ns/iter (+/- 4,412) = 11235 MB/s test slice::reverse_u32 ... bench: 74,731 ns/iter (+/- 1,425) = 14031 MB/s test slice::reverse_u64 ... bench: 47,556 ns/iter (+/- 3,025) = 22049 MB/s ``` If you're curious about the assembly, instead of doing this ``` movzx eax, byte ptr [rdi] movzx ecx, byte ptr [rsi] mov byte ptr [rdi], cl mov byte ptr [rsi], al ``` it does this ``` mov rax, qword ptr [rdx] mov rbx, qword ptr [r11 + rcx - 8] bswap rbx mov qword ptr [rdx], rbx bswap rax mov qword ptr [r11 + rcx - 8], rax ```
☀️ Test successful - status-appveyor, status-travis |
llvm byte swap intrinsic works fine on ARM as well :/ |
@gnzlbg Right, but older ARM (like v4 & v5) doesn't have fast unaligned access, so I doubted that |
Oh, did not knew that.
This make sense, maybe we should fill an issue and mark it with the easy tag + ARM or something. One would just need to try this on ARMv6/v7/v8 and choose an appropriate set of features. |
Since LLVM doesn't vectorize the loop for us, do unaligned reads of a larger type and use LLVM's bswap intrinsic to do the reversing of the actual bytes. cfg!-restricted to x86 and x86_64, as I assume it wouldn't help on things like ARMv5.
Also makes [u16]::reverse() a more modest 1.5x faster by loading/storing u32 and swapping the u16s with ROT16.
Thank you ptr::*_unaligned for making this easy :)
Benchmark results (from my i5-2500K):
If you're curious about the assembly, instead of doing this
it does this