-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unaligned volatile intrinsics #52391
Conversation
Huh? Volatile accesses don't synchronize, so how can these intrinsics not be a source of data races in this context? |
@rkruppe I have a large array of Each individual byte is independent and there is no memory ordering requirement between them. I use an The load needs to be volatile because the underlying bytes may be concurrently modified by other threads as they are being read. |
I'm fairly certain that's a data race, I don't see how the later fence can help. The "right" way would be an unaligned atomic load, but that's rarely supported and especially not for anything larger than word size. Because of that, I am actually skeptical whether
So it would appear that even |
@rkruppe I don't need the full 16 byte load to be atomic: it's fine if each of the 16 bytes is loaded individually in an arbitrary order. In other words, I want to perform 16 individual relaxed |
I see, thanks for clarifying. But at the same time I still don't see how the Intel language could be read to guarantee that In any case, even if this turns out to be guaranteed by Intel, I still don't think a volatile access + acq fence is sufficient to rule out language/IR level UB. I would recommend side-stepping the issue by using |
16 separate byte loads with relaxed ordering is literally the weakest restriction that can possibly put on such a load: you cannot split it up any further, and there is no restriction on the order that the individual bytes are loaded. If the hardware loads multiple bytes at once then that only makes the access stronger, not weaker. In fact, this is exactly what the Intel documentation states:
The reason I can't just use a normal unaligned load is that the underlying bytes can be changed by another thread at any time, even in the middle of a load. By default, the compiler assumes that normal memory accesses are data-race free and two consecutive loads will always return the same value, which is not the case here. The use of the @alexcrichton So yes, I still need this for my use case. We aren't stabilizing anything here, I would just like to have this available so that I can experiment on nightly. |
This assumes byte-sized accesses are the smallest possible access at the relevant hardware levels, which seems natural but doesn't appear to be explicitly stated anywhere in the memory model document. There may also be other implicit assumptions I don't see right now. This all sounds a bit remote, admittedly, but as I said, this is an extremely dark and obscure corner of memory models to begin with. I wouldn't be surprised if none of the people who wrote that document have ever actually thought about this specific scenario. In any case, let's leave the ISA details behind us, I'll go on assuming you are right. Even then I object to this intrinsic.
This is incorrect. Volatile accesses don't do anything like that, they have "no cross-thread synchronization behavior" (LLVM LangRef) and there is no exception carved out for volatile accesses in the Rust-level statement "data races are UB". In LLVM there is a vague bullet point that the byte value read by a volatile read "is target dependent", but this is justified by referencing signal handlers and "addresses that do not behave like normal memory", with no mention of cross-thread communication through ordinary memory. While volatile is always murky and partially implementation-defined and compilers tend to err on the side of caution, it is factually incorrect that volatile side-steps memory ordering concerns.
Yes, a volatile access is (if possible) lowered to a single load or store, but that has nothing to do with data race freedom. Furthermore, I don't care much for precedent from the Linux kernel when it comes to matters of standard/rules lawyering. The kernel is very happy to rely on GCC implementation-defined behavior, and even altering GCC's default behavior when they don't like it (e.g., strict aliasing rules, or null dereferences being UB).
Again, you can use inline assembly (with the And as a bonus, we won't have this huge footgun lying around. Concurrency aside, on quite a few ISAs unaligned loads and stores don't even exist, so anyone (correctly) using volatile to e.g. do MMIO (where the "single instruction" rule would be relevant) would be screwed if they tried to use the unaligned variant for that. |
From your link, you can see that if R is not atomic and R_byte may see more than one write, then it falls through to the last bullet point and R_byte returns However this does not apply when the access is volatile, instead the result is target-dependent. In practice, this means that the result is whatever would be returned by a load instruction on that target, which is never This is of course separate from any memory orderings on the load itself (e.g.
I want to use this on more than just x86, I only used
For architectures without unaligned memory access, LLVM will lower this to a sequence of single byte load. Also I disagree that this is a huge footgun, we are only adding an unstable intrinsic in the PR. We are not stabilizing anything yet. |
That's not the Rust definition of data races though, so not relevant unless Rust defers to LLVM here. But it's not even clear whether Rust is using LLVM's definition of happens-before etc. or C11's or C++11's or something subtly different, much less whether we're treating volatile accesses the same way. I mostly referenced the LLVM document to point out that even there, doing cross-thread synchronization with target ISA semantics is explicitly not the intended use of volatile accesses.
Again, I don't see how the later fence can establish any happens-before for the earlier load. Furthermore, your argument up to this point seemed to be that happens-before doesn't matter because your load is going straight to the target hardware semantics, so why are we back to caring about happens-before? (For that matter, wouldn't you also need the writes to be volatile to ensure they work by target machine semantics?)
There, too, you will have to reason about the target ISA's memory model to verify whether your algorithm is correct. So it seems like a feature, not a bug, that you'd need a separate bit of asm for every architecture.
Yes, that's precisely why it would be completely wrong to use for anyone who wants to produce a single machine level memory access.
You'd be using it, others can too. I'm particularly worried since in other contexts unaligned accesses are a safer default (since they assume strictly less), but here it's different depending on target ISA details. |
Ping from triage! What's the status of this PR? |
The PR is fine and the added functions do exactly what the name says: it's a volatile load/store with an alignment of 1.
I don't see how this is any different from |
📌 Commit 303306c has been approved by |
@bors rollup |
…richton Add unaligned volatile intrinsics Surprisingly enough, it turns out that unaligned volatile loads are actually useful for certain (very niche) types of lock-free code. I included unaligned volatile stores for completeness, but I currently do not know of any use cases for them. These are only exposed as intrinsics for now. If they turn out to be useful in practice, we can work towards stabilizing them. r? @alexcrichton
Rollup of 7 pull requests Successful merges: - #52391 (Add unaligned volatile intrinsics) - #52402 (impl PartialEq+Eq for BuildHasherDefault) - #52645 (Allow declaring existential types inside blocks) - #52656 (Stablize Redox Unix Sockets) - #52658 (Prefer `Option::map`/etc over `match` wherever it improves clarity) - #52668 (clarify pointer offset function safety concerns) - #52677 (Release notes: add some missing 1.28 libs stabilization) Failed merges: r? @ghost
That's just wrong. The Linux kernel relies on GCC-specific behavior. volatile has nothing to do with concurrency. It is never correct to use volatile to get rid of data races. The only effect of volatile is to prevent the compiler from duplicating or removing the accesses. It has no bearing on the relative ordering of other, non-volatile accesses (including atomic accesses). If that is the only motivation for this PR, I think it should get reverted. I mean, I can totally see that unaligned volatile accesses are useful, e.g. for memory-mapped IO registers. They are just never useful for lock-free code. |
Looks like there's been some confusion. Perhaps we should look at a concrete example in So this part of the
So the crucial question here is: How should the value in first step be read? I'm going to offer three options:
The first one is dubious - maybe it's OK but I wouldn't bet on it. What do you think - is the second option okay? |
Ah, IIRC that one is famous for causing trouble. ;) Sounds somewhat similar to a seq lock to me, which ha a similar problem (doing a read that is racy but checking later).
If it happens through an atomic load (option 3), your reasoning is fine, I think. Can you not use some If this happens through a non-atomic access, unfortunately data races are insta-UB no matter what you do with the value. The docs say we are following the C11 memory model, and this is how that model works. There are optimizations which are incorrect if you weaken this to "read-write races make the read return So, just from the reference/standard, option 1 and 2 are equally UB. Whether they lead to actual problems is a different question, but I think we should not play games of the sort "but the compiler will never be smart enough to exploit this" -- in particular not with concurrency where exhaustive testing is pretty much impossible. |
That said, the only optimization I am aware of that would break this is a redundant read elimination, which is not valid when using volatile loads. Still, that's not a theorem. The only ways to actually follow the standard are:
unsafe fn read_atomic_relaxed(ptr: *const u8) -> u8 {
use std::sync::atomic;
let aptr = ptr as *const atomic::AtomicU8;
(&*aptr).load(atomic::Ordering::Relaxed)
} I could also imagine having |
I opened a PR to clarify the documentation of the stable volatile pointer methods: #53022 |
Thank you for clarifying, that's very helpful!
Right. Incidentally, we're also facing the same problem in a seq lock here. :)
So in Chase-Lev and seq lock we have read-write races. We could just wrap data in The current blockers for this are:
Unfortunately, What would you suggest to do at this moment? |
Heh. ;) You might be interested in http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
Precisely. And then only access the inner value after checking that there was no race. This solves undef/poison, but not the data race issue.
I'm working on that. ;)
The same goes for the intrinsic you added here, though?
Hm, fair point. Coming from a formal/theory background, I tend to sometimes see things very black -and-white: If the theorem doesn't hold, it's false and all hope is lost. So please stand by while I reconfigure to a more applied researcher... ;) I suppose using Other than that, you could... push stabilization of Stuff (TM)? ;) |
The problem with using My intent with unaligned volatile loads was to perform a 16-byte load in a single instruction, but without any requirement that the load be atomic (which is the case if the address is unaligned). Where the memory model allows a read to "see" more than one write, each byte of the load is allowed to return the corresponding byte from any one of those writes (different bytes may return data from different writes). This requires slightly stronger guarantees than just returning |
I understand. However, according to C11 concurrency, and hence in Rust, a non-atomic load racing with an atomic write is undefined behavior. Not "unspecified value" or so, but full UB.
No, it does not:
So certainly for non-volatile, the result is undef. For volatile, it doesn't specify, but note the part about not providing cross-thread synchronization. |
Not disagreeing with you, but: Relaxed atomic accesses do not provide cross-thread synchronization either (i.e. they don't establish happens-before relationships), so I believe that sentence is about a completely separate issue. I just took a peek into the C11 standard and it seems you're right about data races (more specifically, "torn" reads) being instant UB, even if the data is never read. So we really need that |
volatile operations docs: clarify that this does not help wrt. concurrency Triggered by rust-lang#52391. Cc @stjepang @Amanieu Should the intrinsics themselves also get more documentation? They generally do not seem to have much of that.
volatile operations docs: clarify that this does not help wrt. concurrency Triggered by rust-lang#52391. Cc @stjepang @Amanieu Should the intrinsics themselves also get more documentation? They generally do not seem to have much of that.
Surprisingly enough, it turns out that unaligned volatile loads are actually useful for certain (very niche) types of lock-free code. I included unaligned volatile stores for completeness, but I currently do not know of any use cases for them.
These are only exposed as intrinsics for now. If they turn out to be useful in practice, we can work towards stabilizing them.
r? @alexcrichton