Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AArch64 CRC32 instructions #6

Merged
merged 1 commit into from
Jan 19, 2019

Conversation

valpackett
Copy link
Contributor

@valpackett valpackett commented Dec 8, 2018

This should eventually be done using intrinsics, but

Ideally, CPU capabilities should be checked too, but stdsimd doesn't do that on FreeBSD on non-x86 CPUs (elf_aux_info) yet. (And all my machines run FreeBSD :D) CRC is mandatory in ARMv8.1 anyway, and there are very few v8.0 chips without it.

see comments


Some fun bench runs!

tfw a humble ARM Cortex-A72 @ 2.18GHz (Rockchip RK3399, cpuset -l4-5):

test bench_kilobyte_baseline    ... bench:       1,167 ns/iter (+/- 41) = 877 MB/s
test bench_kilobyte_specialized ... bench:          80 ns/iter (+/- 0) = 12800 MB/s
test bench_megabyte_baseline    ... bench:   1,201,396 ns/iter (+/- 10,942) = 872 MB/s
test bench_megabyte_specialized ... bench:     185,709 ns/iter (+/- 197,317) = 5646 MB/s

matches a Ryzen 7 1700 @ 3.85GHz (well, in one test)

test bench_kilobyte_baseline    ... bench:         301 ns/iter (+/- 1) = 3401 MB/s
test bench_kilobyte_specialized ... bench:          80 ns/iter (+/- 0) = 12800 MB/s
test bench_megabyte_baseline    ... bench:     302,681 ns/iter (+/- 710) = 3464 MB/s
test bench_megabyte_specialized ... bench:      73,559 ns/iter (+/- 168) = 14254 MB/s

while the Cortex-A53 (@ 1.6GHz, Rockchip RK3399, cpuset -l0-3) is that much worse than the A72:

test bench_kilobyte_baseline    ... bench:       1,922 ns/iter (+/- 52) = 532 MB/s
test bench_kilobyte_specialized ... bench:         229 ns/iter (+/- 0) = 4471 MB/s
test bench_megabyte_baseline    ... bench:   1,904,466 ns/iter (+/- 14,574) = 550 MB/s
test bench_megabyte_specialized ... bench:     563,171 ns/iter (+/- 10,137) = 1861 MB/s

and Cavium ThunderX (Scaleway's KVM VPS) has terrible CRC32 units in particular:

test bench_kilobyte_baseline    ... bench:       1,436 ns/iter (+/- 22) = 713 MB/s
test bench_kilobyte_specialized ... bench:         375 ns/iter (+/- 19) = 2730 MB/s
test bench_megabyte_baseline    ... bench:   2,332,481 ns/iter (+/- 618,310) = 449 MB/s
test bench_megabyte_specialized ... bench:     595,290 ns/iter (+/- 65,599) = 1761 MB/s

upd: my phone: Qualcomm Snapdragon 660 (Kryo V2, 2.2GHz, weird big.little management?):

test bench_kilobyte_baseline    ... bench:         916 ns/iter (+/- 19) = 1117 MB/s
test bench_kilobyte_specialized ... bench:         129 ns/iter (+/- 1) = 7937 MB/s
test bench_megabyte_baseline    ... bench:     951,838 ns/iter (+/- 28,575) = 1101 MB/s
test bench_megabyte_specialized ... bench:     124,165 ns/iter (+/- 4,731) = 8445 MB/s

upd: Amazon EC2 a1 instance (Graviton, also A72) — looks like more cache than RK3399

test bench_kilobyte_baseline    ... bench:       1,193 ns/iter (+/- 38) = 858 MB/s
test bench_kilobyte_specialized ... bench:          69 ns/iter (+/- 0) = 14840 MB/s
test bench_megabyte_baseline    ... bench:   1,114,682 ns/iter (+/- 3,522) = 940 MB/s
test bench_megabyte_specialized ... bench:      72,397 ns/iter (+/- 242) = 14483 MB/s

upd: Packet c2.large.arm (Ampere eMAG)

test bench_kilobyte_baseline    ... bench:         815 ns/iter (+/- 12) = 1256 MB/s
test bench_kilobyte_specialized ... bench:         103 ns/iter (+/- 0) = 9941 MB/s
test bench_megabyte_baseline    ... bench:     875,233 ns/iter (+/- 1,133) = 1198 MB/s
test bench_megabyte_specialized ... bench:      92,762 ns/iter (+/- 77) = 11303 MB/s

upd: Marvell MACCHIATObin (A72 @ 1.6GHz)

test bench_kilobyte_baseline    ... bench:       1,471 ns/iter (+/- 0) = 696 MB/s
test bench_kilobyte_specialized ... bench:          99 ns/iter (+/- 0) = 10343 MB/s
test bench_megabyte_baseline    ... bench:   1,541,176 ns/iter (+/- 18,950) = 680 MB/s
test bench_megabyte_specialized ... bench:     109,078 ns/iter (+/- 78,118) = 9613 MB/s

upd: Marvell MACCHIATObin (A72 @ 2.0GHz)

test bench_kilobyte_baseline    ... bench:       1,176 ns/iter (+/- 0) = 870 MB/s
test bench_kilobyte_specialized ... bench:          79 ns/iter (+/- 0) = 12962 MB/s
test bench_megabyte_baseline    ... bench:   1,212,772 ns/iter (+/- 13,830) = 864 MB/s
test bench_megabyte_specialized ... bench:      86,684 ns/iter (+/- 70,792) = 12096 MB/s

upd: Amazon EC2 m6g (Graviton2, Neoverse N1)

test bench_kilobyte_baseline    ... bench:         622 ns/iter (+/- 12) = 1646 MB/s
test bench_kilobyte_specialized ... bench:          60 ns/iter (+/- 0) = 17066 MB/s
test bench_megabyte_baseline    ... bench:     710,155 ns/iter (+/- 3,955) = 1476 MB/s
test bench_megabyte_specialized ... bench:      53,606 ns/iter (+/- 210) = 19560 MB/s

upd: Apple M1 Max (MacBook Pro) thanks weatherlight — impressive baseline, but unimpressive HW crc32 units

test bench_kilobyte_baseline    ... bench:         223 ns/iter (+/- 8) = 4591 MB/s
test bench_kilobyte_specialized ... bench:         100 ns/iter (+/- 0) = 10240 MB/s
test bench_megabyte_baseline    ... bench:     231,689 ns/iter (+/- 1,885) = 4525 MB/s
test bench_megabyte_specialized ... bench:     122,382 ns/iter (+/- 4,651) = 8568 MB/s

upd: Ryzen 9 5950X @ PBO for comparison

test bench_kilobyte_baseline    ... bench:         211 ns/iter (+/- 3) = 4853 MB/s
test bench_kilobyte_specialized ... bench:          62 ns/iter (+/- 2) = 16516 MB/s
test bench_megabyte_baseline    ... bench:     206,821 ns/iter (+/- 2,061) = 5069 MB/s
test bench_megabyte_specialized ... bench:      58,008 ns/iter (+/- 1,243) = 18076 MB/s

Copy link
Owner

@srijs srijs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for this! The benchmarks certainly look promising.

I've left comments in-line to be addressed.

src/lib.rs Outdated Show resolved Hide resolved
src/specialized/aarch64.rs Outdated Show resolved Hide resolved
@srijs
Copy link
Owner

srijs commented Dec 8, 2018

Forgot to say this, but if you wanted to use llvm instrinsics instead of inline assembly, you may be able to use the link_llvm_intrinsics feature and then do something like this instead:

extern {
        #[link_name = "llvm.aarch64.crc32x"]
        pub unsafe fn crc32x(a: i32, b: i64) -> i32;
}

@valpackett valpackett changed the title Add support for AArch64 CRC32 instructions using inline asm Add support for AArch64 CRC32 instructions Dec 8, 2018
@valpackett valpackett force-pushed the aarch64 branch 2 times, most recently from 188eab0 to 0fa7925 Compare December 8, 2018 15:07
@valpackett
Copy link
Contributor Author

valpackett commented Dec 8, 2018

Now using intrinsics and detection via stdsimd, which should be added by rust-lang/stdarch#612 :) So waiting on that.

let mut ptr4;
let mut ptr8;

if len != 0 && ((ptr as usize) & 1) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this perhaps use the recently stabilized align_to method on slices to do the workhorse of the logic around alignment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, this is a very nice method! (and the chunks_exact iterator too)

A quick attempt at using it here though made performance significantly worse. I'll investigate that later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it just wasn't inlining the intrinsics' wrappers because I removed the target_feature attr. lol.

@srijs
Copy link
Owner

srijs commented Dec 12, 2018

@alexcrichton What would the timeline look like to get the crc instrinsics change shipped to nightly?

@alexcrichton
Copy link
Contributor

Hopefully soon!

@srijs
Copy link
Owner

srijs commented Dec 18, 2018

@myfreeweb It looks like the crc* functions have now landed in nightly, is that's all that needed or do we need to wait for your changes from rust-lang/stdarch#611 to hit nightly as well?

@valpackett
Copy link
Contributor Author

That's all for this project of course.

@valpackett
Copy link
Contributor Author

Rebased, updated for new intrinsic names rust-lang/stdarch#626 let's wait for them to land in nightly

@srijs
Copy link
Owner

srijs commented Jan 17, 2019

It looks like the change to the instrinsic names has landed in nightly 🎉

Let me know if you want any help pushing this over the finish line!

@valpackett
Copy link
Contributor Author

Cool. Removed the temporary stdsimd usage. Should be good to go now I think.

@srijs
Copy link
Owner

srijs commented Jan 19, 2019

Excellent, thanks for all your effort!

@srijs srijs merged commit c49bba0 into srijs:master Jan 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants