-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64: Revert "Use CASP instead of LDXP/STXP for store/RMW if available" on Apple hardware #89
Conversation
…able" on Apple hardware As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC loop-based RMW. So, revert 93e6ec5 on Apple hardware for now. ``` Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50. Benchmarking bench_portable_atomic_arch/u128_concurrent_swap: Collecting 100 bench_portable_atomic_arch/u128_concurrent_swap time: [1.4365 ms 1.5230 ms 1.5948 ms] change: [+378.49% +402.44% +424.27%] (p = 0.00 < 0.05) Performance has regressed. Found 12 outliers among 100 measurements (12.00%) 8 (8.00%) low severe 4 (4.00%) low mild Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Warming u Benchmarking bench_portable_atomic_arch/u128_concurrent_store_swap: Collectin bench_portable_atomic_arch/u128_concurrent_store_swap time: [286.21 µs 292.15 µs 296.65 µs] change: [+96.504% +102.53% +108.63%] (p = 0.00 < 0.05) Performance has regressed. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) low severe 2 (2.00%) low mild 1 (1.00%) high mild 2 (2.00%) high severe Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50. Benchmarking bench_portable_atomic_arch/u128_concurrent_fetch_add: Collecting bench_portable_atomic_arch/u128_concurrent_fetch_add time: [1.6351 ms 1.6787 ms 1.7170 ms] change: [+279.29% +294.80% +309.89%] (p = 0.00 < 0.05) Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) low severe 4 (4.00%) low mild 1 (1.00%) high mild ```
|
||
// As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC | ||
// loop-based RMW: https://github.com/taiki-e/portable-atomic/pull/89 | ||
if is_macos || target_os == "ios" || target_os == "tvos" || target_os == "watchos" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This misses Asahi Linux...
And is still the case:
Line 164 in 35ca3ed
let is_apple = is_macos || target_os == "ios" || target_os == "tvos" || target_os == "watchos"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to know that the code is being compiled against Asahi Linux at build time?
Otherwise, it would be reasonable to set --cfg portable_atomic_ll_sc_rmw
on the user's side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an option to revert the whole 93e6ec5, but we want to keep the same codegen as LLVM if possible (LLVM uses CASP regardless of whether it is Apple hardware, though), and we also need to make sure that it does not lose performance on hardware that CASP is fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to know that the code is being compiled against Asahi Linux at build time?
I can think of is:
- target_vendor. If custom targets are used.
-C target-cpu=apple-*
. If the user uses it.
Both cases can be detected at build time, but I would like to know which one Asahi Linux users actually use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented the latter approach in 5c3a43b.
$ RUSTFLAGS='-C target-cpu=apple-m1' cargo build -vv --target aarch64-unknown-linux-gnu | grep portable_atomic_ll_sc_rmw
[portable-atomic 1.4.3] cargo:rustc-cfg=portable_atomic_ll_sc_rmw(build)
EDIT: Published in 1.5.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teohhanhui Was the above approach sufficient for you? Or was it not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not using any RUSTFLAGS
, so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear:
- Those specified in .cargo/config (or .cargo/config.toml) are also taken into account.
- With the builtin aarch64-unknown-linux-{gnu,musl} target, if you specify neither target-cpu nor target-feature, the default one (LL/SC) will be used and you should get good performance on apple hardwere.
As of Apple M1/M1 Pro, on Apple hardware, CAS loop-based RMW is much slower than LL/SC loop-based RMW.
So, revert 93e6ec5 (#70) on Apple hardware for now.