Add support for AArch64 CRC32 instructions #6

valpackett · 2018-12-08T01:06:04Z

~~This should eventually be done using intrinsics, but~~

~~core::arch::aarch64 doesn't have any crc32 intrinsics right now~~
the intrinsic access is only stable for x86 anyway I think??

Ideally, CPU capabilities should be checked too, but stdsimd doesn't do that on FreeBSD on non-x86 CPUs (elf_aux_info) yet. (And all my machines run FreeBSD :D) CRC is mandatory in ARMv8.1 anyway, and there are very few v8.0 chips without it.

see comments

Some fun bench runs!

tfw a humble ARM Cortex-A72 @ 2.18GHz (Rockchip RK3399, cpuset -l4-5):

test bench_kilobyte_baseline    ... bench:       1,167 ns/iter (+/- 41) = 877 MB/s
test bench_kilobyte_specialized ... bench:          80 ns/iter (+/- 0) = 12800 MB/s
test bench_megabyte_baseline    ... bench:   1,201,396 ns/iter (+/- 10,942) = 872 MB/s
test bench_megabyte_specialized ... bench:     185,709 ns/iter (+/- 197,317) = 5646 MB/s

matches a Ryzen 7 1700 @ 3.85GHz (well, in one test)

test bench_kilobyte_baseline    ... bench:         301 ns/iter (+/- 1) = 3401 MB/s
test bench_kilobyte_specialized ... bench:          80 ns/iter (+/- 0) = 12800 MB/s
test bench_megabyte_baseline    ... bench:     302,681 ns/iter (+/- 710) = 3464 MB/s
test bench_megabyte_specialized ... bench:      73,559 ns/iter (+/- 168) = 14254 MB/s

while the Cortex-A53 (@ 1.6GHz, Rockchip RK3399, cpuset -l0-3) is that much worse than the A72:

test bench_kilobyte_baseline    ... bench:       1,922 ns/iter (+/- 52) = 532 MB/s
test bench_kilobyte_specialized ... bench:         229 ns/iter (+/- 0) = 4471 MB/s
test bench_megabyte_baseline    ... bench:   1,904,466 ns/iter (+/- 14,574) = 550 MB/s
test bench_megabyte_specialized ... bench:     563,171 ns/iter (+/- 10,137) = 1861 MB/s

and Cavium ThunderX (Scaleway's KVM VPS) has terrible CRC32 units in particular:

test bench_kilobyte_baseline    ... bench:       1,436 ns/iter (+/- 22) = 713 MB/s
test bench_kilobyte_specialized ... bench:         375 ns/iter (+/- 19) = 2730 MB/s
test bench_megabyte_baseline    ... bench:   2,332,481 ns/iter (+/- 618,310) = 449 MB/s
test bench_megabyte_specialized ... bench:     595,290 ns/iter (+/- 65,599) = 1761 MB/s

upd: my phone: Qualcomm Snapdragon 660 (Kryo V2, 2.2GHz, weird big.little management?):

test bench_kilobyte_baseline    ... bench:         916 ns/iter (+/- 19) = 1117 MB/s
test bench_kilobyte_specialized ... bench:         129 ns/iter (+/- 1) = 7937 MB/s
test bench_megabyte_baseline    ... bench:     951,838 ns/iter (+/- 28,575) = 1101 MB/s
test bench_megabyte_specialized ... bench:     124,165 ns/iter (+/- 4,731) = 8445 MB/s

upd: Amazon EC2 a1 instance (Graviton, also A72) — looks like more cache than RK3399

test bench_kilobyte_baseline    ... bench:       1,193 ns/iter (+/- 38) = 858 MB/s
test bench_kilobyte_specialized ... bench:          69 ns/iter (+/- 0) = 14840 MB/s
test bench_megabyte_baseline    ... bench:   1,114,682 ns/iter (+/- 3,522) = 940 MB/s
test bench_megabyte_specialized ... bench:      72,397 ns/iter (+/- 242) = 14483 MB/s

upd: Packet c2.large.arm (Ampere eMAG)

test bench_kilobyte_baseline    ... bench:         815 ns/iter (+/- 12) = 1256 MB/s
test bench_kilobyte_specialized ... bench:         103 ns/iter (+/- 0) = 9941 MB/s
test bench_megabyte_baseline    ... bench:     875,233 ns/iter (+/- 1,133) = 1198 MB/s
test bench_megabyte_specialized ... bench:      92,762 ns/iter (+/- 77) = 11303 MB/s

upd: Marvell MACCHIATObin (A72 @ 1.6GHz)

test bench_kilobyte_baseline    ... bench:       1,471 ns/iter (+/- 0) = 696 MB/s
test bench_kilobyte_specialized ... bench:          99 ns/iter (+/- 0) = 10343 MB/s
test bench_megabyte_baseline    ... bench:   1,541,176 ns/iter (+/- 18,950) = 680 MB/s
test bench_megabyte_specialized ... bench:     109,078 ns/iter (+/- 78,118) = 9613 MB/s

upd: Marvell MACCHIATObin (A72 @ 2.0GHz)

test bench_kilobyte_baseline    ... bench:       1,176 ns/iter (+/- 0) = 870 MB/s
test bench_kilobyte_specialized ... bench:          79 ns/iter (+/- 0) = 12962 MB/s
test bench_megabyte_baseline    ... bench:   1,212,772 ns/iter (+/- 13,830) = 864 MB/s
test bench_megabyte_specialized ... bench:      86,684 ns/iter (+/- 70,792) = 12096 MB/s

upd: Amazon EC2 m6g (Graviton2, Neoverse N1)

test bench_kilobyte_baseline    ... bench:         622 ns/iter (+/- 12) = 1646 MB/s
test bench_kilobyte_specialized ... bench:          60 ns/iter (+/- 0) = 17066 MB/s
test bench_megabyte_baseline    ... bench:     710,155 ns/iter (+/- 3,955) = 1476 MB/s
test bench_megabyte_specialized ... bench:      53,606 ns/iter (+/- 210) = 19560 MB/s

upd: Apple M1 Max (MacBook Pro) thanks weatherlight — impressive baseline, but unimpressive HW crc32 units

test bench_kilobyte_baseline    ... bench:         223 ns/iter (+/- 8) = 4591 MB/s
test bench_kilobyte_specialized ... bench:         100 ns/iter (+/- 0) = 10240 MB/s
test bench_megabyte_baseline    ... bench:     231,689 ns/iter (+/- 1,885) = 4525 MB/s
test bench_megabyte_specialized ... bench:     122,382 ns/iter (+/- 4,651) = 8568 MB/s

upd: Ryzen 9 5950X @ PBO for comparison

test bench_kilobyte_baseline    ... bench:         211 ns/iter (+/- 3) = 4853 MB/s
test bench_kilobyte_specialized ... bench:          62 ns/iter (+/- 2) = 16516 MB/s
test bench_megabyte_baseline    ... bench:     206,821 ns/iter (+/- 2,061) = 5069 MB/s
test bench_megabyte_specialized ... bench:      58,008 ns/iter (+/- 1,243) = 18076 MB/s

srijs

Hi, thanks for this! The benchmarks certainly look promising.

I've left comments in-line to be addressed.

src/lib.rs

src/specialized/aarch64.rs

srijs · 2018-12-08T02:26:39Z

Forgot to say this, but if you wanted to use llvm instrinsics instead of inline assembly, you may be able to use the link_llvm_intrinsics feature and then do something like this instead:

extern {
        #[link_name = "llvm.aarch64.crc32x"]
        pub unsafe fn crc32x(a: i32, b: i64) -> i32;
}

valpackett · 2018-12-08T15:12:07Z

Now using intrinsics and detection via stdsimd, which should be added by rust-lang/stdarch#612 :) So waiting on that.

alexcrichton · 2018-12-08T16:46:47Z

src/specialized/aarch64.rs

+    let mut ptr4;
+    let mut ptr8;
+
+    if len != 0 && ((ptr as usize) & 1) != 0 {


Could this perhaps use the recently stabilized align_to method on slices to do the workhorse of the logic around alignment here?

ooh, this is a very nice method! (and the chunks_exact iterator too)

A quick attempt at using it here though made performance significantly worse. I'll investigate that later.

oh, it just wasn't inlining the intrinsics' wrappers because I removed the target_feature attr. lol.

srijs · 2018-12-12T07:01:30Z

@alexcrichton What would the timeline look like to get the crc instrinsics change shipped to nightly?

alexcrichton · 2018-12-12T15:12:10Z

Hopefully soon!

srijs · 2018-12-18T10:29:13Z

@myfreeweb It looks like the crc* functions have now landed in nightly, is that's all that needed or do we need to wait for your changes from rust-lang/stdarch#611 to hit nightly as well?

valpackett · 2018-12-18T12:44:24Z

That's all for this project of course.

valpackett · 2018-12-20T21:12:11Z

Rebased, updated for new intrinsic names rust-lang/stdarch#626 let's wait for them to land in nightly

srijs · 2019-01-17T08:51:04Z

It looks like the change to the instrinsic names has landed in nightly 🎉

Let me know if you want any help pushing this over the finish line!

valpackett · 2019-01-17T16:40:00Z

Cool. Removed the temporary stdsimd usage. Should be good to go now I think.

srijs · 2019-01-19T04:01:07Z

Excellent, thanks for all your effort!

srijs requested changes Dec 8, 2018

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

src/specialized/aarch64.rs Outdated Show resolved Hide resolved

valpackett mentioned this pull request Dec 8, 2018

Add AArch64 CRC32 intrinsics rust-lang/stdarch#612

Merged

valpackett changed the title ~~Add support for AArch64 CRC32 instructions using inline asm~~ Add support for AArch64 CRC32 instructions Dec 8, 2018

valpackett force-pushed the aarch64 branch 2 times, most recently from 188eab0 to 0fa7925 Compare December 8, 2018 15:07

alexcrichton reviewed Dec 8, 2018

View reviewed changes

valpackett force-pushed the aarch64 branch from 0fa7925 to 0d4aaa1 Compare December 9, 2018 16:39

valpackett force-pushed the aarch64 branch from 0d4aaa1 to 509bd30 Compare December 20, 2018 21:10

Add support for AArch64 CRC32 instructions

0c44594

valpackett force-pushed the aarch64 branch from 509bd30 to 0c44594 Compare January 17, 2019 16:27

srijs merged commit c49bba0 into srijs:master Jan 19, 2019

valpackett mentioned this pull request Jul 5, 2020

ARM NEON (AdvSIMD) support nickbabcock/highway-rs#29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AArch64 CRC32 instructions #6

Add support for AArch64 CRC32 instructions #6

valpackett commented Dec 8, 2018 •

edited

Loading

srijs left a comment

srijs commented Dec 8, 2018

valpackett commented Dec 8, 2018 •

edited

Loading

alexcrichton Dec 8, 2018

valpackett Dec 8, 2018

valpackett Dec 9, 2018

srijs commented Dec 12, 2018

alexcrichton commented Dec 12, 2018

srijs commented Dec 18, 2018 •

edited

Loading

valpackett commented Dec 18, 2018

valpackett commented Dec 20, 2018

srijs commented Jan 17, 2019

valpackett commented Jan 17, 2019

srijs commented Jan 19, 2019

Add support for AArch64 CRC32 instructions #6

Add support for AArch64 CRC32 instructions #6

Conversation

valpackett commented Dec 8, 2018 • edited Loading

srijs left a comment

Choose a reason for hiding this comment

srijs commented Dec 8, 2018

valpackett commented Dec 8, 2018 • edited Loading

alexcrichton Dec 8, 2018

Choose a reason for hiding this comment

valpackett Dec 8, 2018

Choose a reason for hiding this comment

valpackett Dec 9, 2018

Choose a reason for hiding this comment

srijs commented Dec 12, 2018

alexcrichton commented Dec 12, 2018

srijs commented Dec 18, 2018 • edited Loading

valpackett commented Dec 18, 2018

valpackett commented Dec 20, 2018

srijs commented Jan 17, 2019

valpackett commented Jan 17, 2019

srijs commented Jan 19, 2019

valpackett commented Dec 8, 2018 •

edited

Loading

valpackett commented Dec 8, 2018 •

edited

Loading

srijs commented Dec 18, 2018 •

edited

Loading