Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nontemporal stores behave incorrectly in their interaction with concurrency primitives #64521

Open
RalfJung opened this issue Aug 8, 2023 · 5 comments
Labels
documentation llvm Umbrella label for LLVM issues

Comments

@RalfJung
Copy link
Contributor

RalfJung commented Aug 8, 2023

The LLVM LangRef doesn't document how !nontemporal stores are intended to interact with concurrency primitives. The current interactions are extremely surprising, basically making !nontemporal stores even less ordered than "non-atomic" stores:

Thread A:
  store i32 %v, ptr %p, !nontemporal !13
  fence release
  // set some global flag (relaxed write)

Thread B:
  // wait till global flag is set (relaxed read)
  fence acquire
  %_0 = load i32, ptr %p, !noundef !11

According to all the usual concurrency rules, that last load must see the store. However, the way LLVM compiles this program on x86, it has a data race: the fences become NOPs and the relaxed accesses become regular MOV, so we end up with MOVNT; MOV in thread A, which the CPU is allowed to reorder (see e.g. this long and detailed post on MOVNT) -- meaning that thread B might see the flag write but then fail to see the data store!

In other words, MOVNT violates TSO, but the compilation scheme LLVM (and everyone else) uses for release/acquire synchronization relies on TSO. Together this leads to rather unpredictable semantics. Are nontemporal stores meant to completely bypass normal memory model rules (in which case they are super dangerous to use anywhere), or are they meant to follow the usual rules (in which case LLVM needs to ensure there is an sfence between each nontemporal store and later release operations)?

@workingjubilee
Copy link

To be clear, this is slightly more than just a documentation bug, as

  • !nontemporal is a cross-ISA i.e. LLVM language annotation
  • atomic store ... release or fence release are lowered to nops on x86 targets, due to assuming the ISA implements TSO for all store operations, instead of a highly modified interpretation that allows selectively weakening store-store ordering in the presence of the MOVNT* family of instructions
    • these TSO-weakening instructions can be selected during lowering of store ... !temporal
    • in the presence of weakened TSO, atomic synchronization instead requires SFENCE
  • the only way of forcing a cross-ISA store-store synchronization that would get lowered correctly is the use of a much stronger ordering, like fence seq_cst!
  • the only way to resolve this as a frontend without incurring the wastefully stronger fence is to use architecture-specific intrinsics like llvm.x86.sse.sfence or assembly, neither of which are resilient when targeting multiple architectures, as other ISAs have added nontemporal stores also (fortunately, they are ones which already had weaker atomic semantics, so fence release should still work... so far)
  • that is, this has to be at least one of:
    • unsoundness
    • a missed optimization
    • ergonomics that makes simply writing machine-specific assembly look good

And Ralf's primary concern is about load and store reorderings due to optimizations not being mindful of these weaker-than-TSO orderings described by MOVNTDQ, MOVNTI, and so on, and the need for SFENCE to not be moved in some cases. It's not even clear how much we can rely on inline assembly (when emitted as separate statements or present in separate functions) not being reordered or elided by such compiler optimizations.

Thus, Ralf saying, "The LLVM LangRef doesn't document how so-and-so interacts with concurrency..." might, because he is a concurrency expert yet does not feel he knows how to resolve the concurrency problems here, be rendered in a more conversational dialect of English more like, "So, how fucked are we? Because this situation seems fucked up beyond all recognition."

Ideally, there would be a way to force store !nontemporal to come with emitting a metadata node of some kind that indicates the need for SFENCE. This could then be optimized to, upon encountering multiple store !nontemporal in close succession (say, after inlining), only emitting one actual SFENCE, until it reaches a fence release, atomic store ... release, or anything that would emit a much stronger barrier instead. It would also be eligible for usage in something analogous to the existing VZEROUPPER-insertion pass, to prevent returning to safe code without first reaching SFENCE.

@RalfJung
Copy link
Contributor Author

RalfJung commented Aug 9, 2023

Yeah, the current status seems to be that a store !nontemporal not followed by an sfence is just full UB, since release writes and fences fail to have their documented behavior. If a store !nontemporal is eventually followed by an sfence, the machine is still in a strange state until that happens (namely a state where release operations don't behave as documented), and it is unclear to me if that can cause other issues (e.g. due to the compiler inserting release operations there -- obviously LLVM optimizations can generally assume that release always works as documented).

other ISAs have added nontemporal stores also (fortunately, they are ones which already had weaker atomic semantics, so fence release should still work... so far)

I haven't looked into that at all -- how "strange" do nontemporal stores behave on those ISAs? Are they somehow weaker than regular non-atomic accesses, or is it truly just a hint to the CPU to not keep this cacheline cached?

@RalfJung RalfJung changed the title Missing documentation on how nontemporal stores interact with concurrency primitives nontemporal stores interact poorly with concurrency primitives Aug 9, 2023
@RalfJung
Copy link
Contributor Author

RalfJung commented Aug 9, 2023

I also haven't looked into nontemporal loads at all, and whether they can cause any havoc.

@workingjubilee
Copy link

workingjubilee commented Aug 10, 2023

re: nontemporal stores on AArch64:

  • no special caveats, a "true hint"
  • the architecture definitionally uses weakly ordered stores, so DMB ("Data Memory Barrier") is still needed to sort things out
  • a note that shouldn't matter: there are slightly wider "access bands" the CPU is allowed to speculatively examine, these are always aligned though so we don't need to worry about e.g. hitting the next page, and it shouldn't be user-visible

re: nontemporal loads on AArch64:

  • weaker-than-normal load ordering: code can observe reorderings if the nontemporal loads interact with address dependencies ("load address, then load from that address")
  • DMB still sorts things out as usual

@chorman0773
Copy link

Wasn't it noted that it wasn't just dependencies that don't interact properly, but a stale value of the address register could be used?

@RalfJung RalfJung changed the title nontemporal stores interact poorly with concurrency primitives nontemporal stores behave incorrectly in their interaction with concurrency primitives Sep 6, 2023
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Aug 12, 2024
…ouxu,Amanieu,Jubilee

nontemporal_store: make sure that the intrinsic is truly just a hint

The `!nontemporal` flag for stores in LLVM *sounds* like it is just a hint, but actually, it is not -- at least on x86, non-temporal stores need very special treatment by the programmer or else the Rust memory model breaks down. LLVM still treats these stores as-if they were normal stores for optimizations, which is [highly dubious](llvm/llvm-project#64521). Let's avoid all that dubiousness by making our own non-temporal stores be truly just a hint, which is possible on some targets (e.g. ARM). On all other targets, non-temporal stores become regular stores.

~~Blocked on rust-lang/stdarch#1541 propagating to the rustc repo, to make sure the `_mm_stream` intrinsics are unaffected by this change.~~

Fixes rust-lang#114582
Cc `@Amanieu` `@workingjubilee`
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Aug 12, 2024
Rollup merge of rust-lang#128149 - RalfJung:nontemporal_store, r=jieyouxu,Amanieu,Jubilee

nontemporal_store: make sure that the intrinsic is truly just a hint

The `!nontemporal` flag for stores in LLVM *sounds* like it is just a hint, but actually, it is not -- at least on x86, non-temporal stores need very special treatment by the programmer or else the Rust memory model breaks down. LLVM still treats these stores as-if they were normal stores for optimizations, which is [highly dubious](llvm/llvm-project#64521). Let's avoid all that dubiousness by making our own non-temporal stores be truly just a hint, which is possible on some targets (e.g. ARM). On all other targets, non-temporal stores become regular stores.

~~Blocked on rust-lang/stdarch#1541 propagating to the rustc repo, to make sure the `_mm_stream` intrinsics are unaffected by this change.~~

Fixes rust-lang#114582
Cc `@Amanieu` `@workingjubilee`
bjorn3 pushed a commit to rust-lang/rustc_codegen_cranelift that referenced this issue Aug 15, 2024
…ieu,Jubilee

nontemporal_store: make sure that the intrinsic is truly just a hint

The `!nontemporal` flag for stores in LLVM *sounds* like it is just a hint, but actually, it is not -- at least on x86, non-temporal stores need very special treatment by the programmer or else the Rust memory model breaks down. LLVM still treats these stores as-if they were normal stores for optimizations, which is [highly dubious](llvm/llvm-project#64521). Let's avoid all that dubiousness by making our own non-temporal stores be truly just a hint, which is possible on some targets (e.g. ARM). On all other targets, non-temporal stores become regular stores.

~~Blocked on rust-lang/stdarch#1541 propagating to the rustc repo, to make sure the `_mm_stream` intrinsics are unaffected by this change.~~

Fixes rust-lang/rust#114582
Cc `@Amanieu` `@workingjubilee`
ackxolotl added a commit to ackxolotl/memcpy that referenced this issue Aug 20, 2024
We definitely need fences there, see:
https://doc.rust-lang.org/core/arch/x86/fn._mm_sfence.html

Even more interesting, the discussion on NT stores in Rust:
rust-lang/rust#114582

And on broken NT stores in LLVM:
llvm/llvm-project#64521

Oh boy ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation llvm Umbrella label for LLVM issues
Projects
None yet
Development

No branches or pull requests

4 participants