aarch64: Revert "Use CASP instead of LDXP/STXP for store/RMW if available" on Apple hardware #89

taiki-e · 2023-03-25T17:52:06Z

As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC loop-based RMW.
So, revert 93e6ec5 (#70) on Apple hardware for now.

Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Collecting 100
bench_portable_atomic_arch/u128_concurrent_swap
                        time:   [1.4365 ms 1.5230 ms 1.5948 ms]
                        change: [+378.49% +402.44% +424.27%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) low severe
  4 (4.00%) low mild
Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Warming u
Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Collectin
bench_portable_atomic_arch/u128_concurrent_store_swap
                        time:   [286.21 µs 292.15 µs 296.65 µs]
                        change: [+96.504% +102.53% +108.63%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Collecting
bench_portable_atomic_arch/u128_concurrent_fetch_add
                        time:   [1.6351 ms 1.6787 ms 1.7170 ms]
                        change: [+279.29% +294.80% +309.89%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low severe
  4 (4.00%) low mild
  1 (1.00%) high mild

…able" on Apple hardware As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC loop-based RMW. So, revert 93e6ec5 on Apple hardware for now. ``` Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50. Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Collecting 100 bench_portable_atomic_arch/u128_concurrent_swap time: [1.4365 ms 1.5230 ms 1.5948 ms] change: [+378.49% +402.44% +424.27%] (p = 0.00 < 0.05) Performance has regressed. Found 12 outliers among 100 measurements (12.00%) 8 (8.00%) low severe 4 (4.00%) low mild Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Warming u Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Collectin bench_portable_atomic_arch/u128_concurrent_store_swap time: [286.21 µs 292.15 µs 296.65 µs] change: [+96.504% +102.53% +108.63%] (p = 0.00 < 0.05) Performance has regressed. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) low severe 2 (2.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50. Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Collecting bench_portable_atomic_arch/u128_concurrent_fetch_add time: [1.6351 ms 1.6787 ms 1.7170 ms] change: [+279.29% +294.80% +309.89%] (p = 0.00 < 0.05) Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) low severe 4 (4.00%) low mild 1 (1.00%) high mild ```

teohhanhui · 2023-10-22T19:15:19Z

build.rs

+
+            // As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC
+            // loop-based RMW: https://github.com/taiki-e/portable-atomic/pull/89
+            if is_macos || target_os == "ios" || target_os == "tvos" || target_os == "watchos" {


This misses Asahi Linux...

And is still the case:

portable-atomic/build.rs

Line 164 in 35ca3ed

let is_apple = is_macos || target_os == "ios" || target_os == "tvos" || target_os == "watchos";

Is there a way to know that the code is being compiled against Asahi Linux at build time?

Otherwise, it would be reasonable to set --cfg portable_atomic_ll_sc_rmw on the user's side.

There is an option to revert the whole 93e6ec5, but we want to keep the same codegen as LLVM if possible (LLVM uses CASP regardless of whether it is Apple hardware, though), and we also need to make sure that it does not lose performance on hardware that CASP is fast.

Is there a way to know that the code is being compiled against Asahi Linux at build time?

I can think of is:

target_vendor. If custom targets are used.

-C target-cpu=apple-*. If the user uses it.

Both cases can be detected at build time, but I would like to know which one Asahi Linux users actually use.

I've implemented the latter approach in 5c3a43b.

$ RUSTFLAGS='-C target-cpu=apple-m1' cargo build -vv --target aarch64-unknown-linux-gnu | grep portable_atomic_ll_sc_rmw [portable-atomic 1.4.3] cargo:rustc-cfg=portable_atomic_ll_sc_rmw(build)

EDIT: Published in 1.5.0.

@teohhanhui Was the above approach sufficient for you? Or was it not?

I'm not using any RUSTFLAGS, so...

To be clear:

Those specified in .cargo/config (or .cargo/config.toml) are also taken into account.

With the builtin aarch64-unknown-linux-{gnu,musl} target, if you specify neither target-cpu nor target-feature, the default one (LL/SC) will be used and you should get good performance on apple hardwere.

#89 (comment)

taiki-e added the O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Mar 25, 2023

taiki-e force-pushed the aarch64-apple branch from ceccecb to 206655e Compare March 25, 2023 17:52

taiki-e merged commit 90e2b0e into main Mar 25, 2023

taiki-e deleted the aarch64-apple branch March 25, 2023 18:56

teohhanhui reviewed Oct 22, 2023

View reviewed changes

taiki-e added a commit that referenced this pull request Oct 23, 2023

aarch64: Recognize target-cpu=apple-* in build script

5c3a43b

#89 (comment)

taiki-e added O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode and removed O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state labels Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64: Revert "Use CASP instead of LDXP/STXP for store/RMW if available" on Apple hardware #89

aarch64: Revert "Use CASP instead of LDXP/STXP for store/RMW if available" on Apple hardware #89

taiki-e commented Mar 25, 2023 •

edited

Loading

teohhanhui Oct 22, 2023 •

edited

Loading

taiki-e Oct 23, 2023

taiki-e Oct 23, 2023

taiki-e Oct 23, 2023 •

edited

Loading

taiki-e Oct 23, 2023 •

edited

Loading

taiki-e Dec 1, 2023

teohhanhui Dec 1, 2023

taiki-e Dec 1, 2023 •

edited

Loading

aarch64: Revert "Use CASP instead of LDXP/STXP for store/RMW if available" on Apple hardware #89

aarch64: Revert "Use CASP instead of LDXP/STXP for store/RMW if available" on Apple hardware #89

Conversation

taiki-e commented Mar 25, 2023 • edited Loading

teohhanhui Oct 22, 2023 • edited Loading

Choose a reason for hiding this comment

taiki-e Oct 23, 2023

Choose a reason for hiding this comment

taiki-e Oct 23, 2023

Choose a reason for hiding this comment

taiki-e Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

taiki-e Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

taiki-e Dec 1, 2023

Choose a reason for hiding this comment

teohhanhui Dec 1, 2023

Choose a reason for hiding this comment

taiki-e Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

taiki-e commented Mar 25, 2023 •

edited

Loading

teohhanhui Oct 22, 2023 •

edited

Loading

taiki-e Oct 23, 2023 •

edited

Loading

taiki-e Oct 23, 2023 •

edited

Loading

taiki-e Dec 1, 2023 •

edited

Loading