-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: AVX512 register state clobbered by signal on macOS #49233
Comments
Hmm, this may be the @vsivsi Can you try patching out |
The runtime doesn't async preempt assembly functions. asyncPreempt (the function in runtime/preempt_amd64.s) should not be executed if it is running in an assembly function. The runtime sends a signal, but if the signal lands in an assembly function the signal handler will return immediately without actually preempt. I suspect that the darwin kernel doesn't preserve AVX512 state when handling signals. Then the problem would not be preemption, but any asynchronous signal could cause it to fail, including profiling signals or user signals. What version of macOS are you running on? I have a vague memory that Apple suggests that before using AVX512 one must check its availability with some sysctl (but I'm not really sure). Do we do that? |
@randall77 I can try that, but the initial issue I encountered (as reflected in the repro case) was clobbering of the K0-K7 opmask registers, which are distinct from the vector registers affected by @cherrymui Yes, Darwin requires special checking to identify AVX512 support. I submitted a fix for this last year. See here for details on that saga: #43089 It is quite possible that this issue I'm seeing is new with golang 1.17. Is it possible that the new assembly adapter funcs required for inter-operation with the new function parameter passing spec could be interfering with the assembly code detection during preemption?
|
How we detect asm routines during preempt changed in Go 1.18. (c2483a5#diff-47fd68949147d260b91998c0d6eabffcdb74c58991fc386b40e14a6ed710b17d) Does this reproduce at tip? (I'm AFK.) |
It does, with:
|
Preemptively disable AVX512 until golang/go#49233 has been resolved. This potentially affects reedsolomon, simdjson, sha256-simd, md5-simd packages. Init order requires a separate package since main itself is initialized last, but imports are initialized in the order they are imported from main (confirmed).
Preemptively disable AVX512 until golang/go#49233 has been resolved. This potentially affects reedsolomon, simdjson, sha256-simd, md5-simd packages. Init order requires a separate package since main itself is initialized last, but imports are initialized in the order they are imported from main (confirmed).
It seems unlikely. You can try building with Go 1.17 with Also, you can try running the program with preemption disabled but CPU profiling enabled and see if profiling signal changes anything. You can also try running the program with preemption disabled while having a separate process sending SIGURG signals to it. Thanks. |
@cherrymui Good suggestions. Running the repro test case as:
This leads to the test failing, and with different looking stats (each iteration fails earlier in the loop on average, probably due to different signaling rates between profiling and preemption). Two other data points:
|
@vsivsi thanks for the repro! So it seems it is not related preemption but signals in general. The observation about K0-K7 also suggests VZEROUPPER is not the cause. Would it be possible to write a similar test in C and assembly (GNU syntax) and see if it is failing? Maybe we need a different CPU feature probe to tell the kernel we're using K0-K7? (I haven't looked at the code carefully. Just a guess.) |
@cherrymui The mechanism for using AVX-512 in Darwin doesn't involve telling the kernel which features to enable. The kernel simply advertises the available AVX-512 features in the process commpage data area. Once you've verified that a given required set of AVX-512 features is available, you simply use them. Darwin catches the UD interrupt for any AVX-512 instruction on first use, and promotes that thread to use the full AVX512 thread state, enabling every available feature. Specifically, the three XCR0[5:7] bits (from XGETBV) indicate whether XSAVE will preserve:
Upon thread promotion following the UD interrupt from a supported AVX512 instruction, Darwin sets XCR0[5:7] = 111b, for the thread, enabling the full "AVX512 state" in XSAVE. |
The kernel definitions of these XCR0 bits are here: The full AVX512 enabled mask is composed here: And this is the function that runs when a UD for an AVX512 instruction is trapped, triggering the XSAVE state promotion: |
@cherrymui Here is an attempted reproduction of this issue in GNU C and Asm. In my testing it does not fail. https://gist.github.com/vsivsi/8511aca1bac528f49fbb45a636afa4b5
|
Thanks for trying that. In your C code, could you install a signal handler for SIGURG (which could simply return immediately)? Thanks. |
@cherrymui Gist updated with empty SIGURG handler. Now it fails as with go runtime:
|
This reminds me of #37174 (comment) |
I'm sure you all are happy that this doesn't appear to be a problem in golang, but it still seems like a problem for golang, no? |
Thanks @vsivsi ! That indicates the kernel's signal handling code doesn't seem to do the right thing. |
So what's the next step here? Throwing a process into an indeterminate state via external signal feels like a CVE-level issue. |
If there is a simple workaround that we could do it in the runtime, we'll do it, like the VZEROUPPER for #37174 . I don't know how to work around this one, though. One probably has to either not use K0-K7 registers or block all signals. I guess the next step would be reporting to Apple and have them fix the kernel. |
While they're at it, they should fix this one too: #42649 Also, not using the K0-K7 opmasks == not using AVX-512 for all practical purposes. |
We need to report a bug to Apple. I can do that - I've done it before, and supposedly a report from Google has a bit more visibility than one from random internet user. |
Seemed like a long-shot, but I just re-tested after updating MacOS with this morning's security updates.
|
I'm nearly certain I've found the kernel bug responsible for this issue by inspection. It looks like it was introduced in Catalina 10.15.6, and has always existed in Big Sur 11.x. I'm presently installing Monterey on one of my MacBook Pros to see what the status there is, since Apple hasn't yet released the Darwin source for MacOS 12.x. Edit to confirm: The issue also exists in Monterey 12.0.1 / Darwin 21.1.0 (21A559). It seems not to exist in Mojave 10.14.x or before, although that would only impact the iMac Pro, since that was the only pre-10.15 machine with AVX-512. The issue is a set of changes introduced in 10.15.6 that are aimed at improving performance when AVX-512 is on, but the thread isn't using the ZMM extended state:
This is analogous to your VZEROUPPER optimization for eliminating phantom SSE4 dependencies. But, when the code detects that ZMM_Hi256 state and Hi16_ZMM state are all zeros, it disables XSAVE restore of the "opmask state" as well. This is not a safe assumption, as it is perfectly valid to have non-zero values in K0-K7 while having all zeroed vector registers (as our various reproductions likely illustrate). With AVX-512VL, it is also valid to use EVEX encoded AVX-512 instructions, with opmasks, while only touching the AVX YMM vector state. My hypothesis above is confirmed by adding a single instruction to the reproduction assembly code sent to Apple: testmask.s .globl _done
.globl _masktest
_masktest:
// Storing a non-zero value in ZMM31 suppresses the kernel bug.
vpbroadcastd %edi, %zmm31
// Put a value in the k1 avx512 register.
kmovd %edi, %k1
// Wait until we're told to return.
loop:
cmpb $0, _done(%rip)
je loop
// Return the value in the k1 register.
kmovd %k1, %eax
ret So, here's the offending code in 10.15.6: if (fpu_allzeroes((uint64_t *)(void *)iavx->x_Hi16_ZMM, 16 * sizeof(_STRUCT_ZMM_REG)) == TRUE &&
fpu_allzeroes((uint64_t *)(void *)iavx->x_ZMM_Hi256, 16 * sizeof(_STRUCT_YMM_REG)) == TRUE) {
iavx->_xh.xstate_bv &= ~XFEM_ZMM;
} From: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L954 Where: #define XFEM_ZMM (XFEM_ZMM_HI256 | XFEM_HI16_ZMM | XFEM_OPMASK) From: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/proc_reg.h#L187 This change was introduced in Darwin xnu-6153.141.1, which corresponds with MacOS10.15.6. Here's the diff: So, the fix is to probably replace a single line of code in two locations (for the 32 and 64 bit cases): // iavx->_xh.xstate_bv &= ~XFEM_ZMM;
iavx->_xh.xstate_bv &= ~(XFEM_ZMM_HI256 | XFEM_HI16_ZMM); Here: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L928 Now we just need to get Apple to do it. In the meantime there's a weak-ish workaround... Ensure that there's always non-zero values in some part of the AVX-512 extended ZMM state. |
Weird. When debugging I thought a use of a zmm register might help, so I added:
and it didn't help. Apparently I needed to set zmm1 to a nonzero value to really "use" that register :( I'll update my Apple feedback submission. |
@randall77 Yeah, I went back and checked the original rarely failing unit test case that kicked this all off... It was Avo generated code that only used the In thinking about the kernel bug presented above overnight, it clarified three things:
On reflection, I'd also like to add that the "one line fix" I presented above may not simultaneously resolve this issue while still satisfying the aims of the Darwin kernel engineers in making the revisions that appeared in Catalina 10.15.6. Another possible fix would be to leave the masking code the same, but add an additional check to ensure that the opmask state also has all zeroed bits. Or perhaps the opmask save/restore should be broken out as a separate case (combining the two approaches). In any case, I hope Apple carefully audits all uses of |
Just a quick update on point 3) in the post above. I've run the reproduction C/Asm code here: As:
When run on a Mac Pro with 48 hyperthreads, each of the 250 processes created above is preemptively multitasked onto ~20% of a hyperthread with a large number of kernel scheduler preemptions per second. In over an hour of CPU time, I've seen zero opmask corruptions. So I think it's fair to conclude that Darwin's preemptive multitasking does not use the faulty thread state resetting code spotted above. 😅 |
Change https://golang.org/cl/368994 mentions this issue: |
The implement of darwinSupportsAVX512 has been removed in CL 361255. But the command "go install -buildmode=shared std" failed because the declaration of it is remained in cpu/cpu_gc_x86.go. It should be removed. Update golang/go#49233 For golang/go#49942 Change-Id: I8fa7c61c20457e49414930029b9f026c335aa421 Reviewed-on: https://go-review.googlesource.com/c/sys/+/368994 Reviewed-by: Tobias Klauser <tobias.klauser@gmail.com> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gopher Robot <gobot@golang.org> Trust: Ian Lance Taylor <iant@golang.org>
Preemptively disable AVX512 until golang/go#49233 has been resolved. This potentially affects reedsolomon, simdjson, sha256-simd, md5-simd packages. Init order requires a separate package since main itself is initialized last, but imports are initialized in the order they are imported from main (confirmed).
Yesterday's security update to MacOS Catalina appears not to have updated the kernel, and my testing indicates this bug remains.
Likewise, the upgrade to MacOS Monterey 12.1 also does not fix this issue.
I haven't tested the security update for Big Sur, but it would be very surprising if this had been fixed there and only there. |
Apple appears to have fixed this kernel bug in Monterey 12.2, but frustratingly did not backport the fix into the corresponding security fixes for Catalina and Big Sur. So per my testing this morning: Fixed:
Still Broken:
Can someone please verify the above results? Pending confirmation of the Monterey 12.2 fix, we should put together a patch for |
In preparation for re-enabling AVX-512 for fully patched MacOS >= Monterey 12.2, I've done a little research on the best way to detect the kernel version in Golang on Darwin. The Darwin kernel version is not present in the process commpage where the AVX-512 capabilities are advertised. So it appears that the path of least resistance is going to be a call to This function will return func darwinKernelVersionCheck(major, minor, patch int) bool {
var un unix.Utsname
err := unix.Uname(&un)
if err != nil {
return false
}
ts := bytes.Split(bytes.TrimRight(un.Release[:], "\x00"), []byte{'.'})
if len(ts) != 3 {
return false
}
var mmp [3]int
for i := 0; i < 3; i++ {
tmp, err := strconv.Atoi(string(ts[i]))
if err != nil {
return false
}
mmp[i] = tmp
}
return mmp[0] > major || mmp[0] == major && (mmp[1] > minor || mmp[1] == minor && mmp[2] >= patch)
} Re-enabling AVX-512 support for fully patched MacOS Monterey machines should simply involve: Revert all changes in these two revisions: Add the above function if runtime.GOOS == "darwin" {
// Check for minimum darwin version and commpage for AVX512 support. Necessary because:
// https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/osfmk/i386/fpu.c#L175-L201
// See also issue: 49233
osSupportsAVX512 = osSupportsAVX && darwinSupportsAVX512() && darwinKernelVersionCheck(21, 3, 0)
} else {
// ... non-darwin case Edited to move the call to |
BTW, if we prefer a version of func darwinKernelVersionCheck(major, minor, patch int) bool {
var un unix.Utsname
err := unix.Uname(&un)
if err != nil {
return false
}
var mmp [3]int
c := 0
Loop:
for _, b := range un.Release {
switch {
case b >= '0' && b <= '9':
mmp[c] = 10*mmp[c] + int(b-'0')
case b == '.':
c++
if c > 2 {
return false
}
case b == 0:
break Loop
default:
return false
}
}
if c != 2 {
return false
}
return mmp[0] > major || mmp[0] == major && (mmp[1] > minor || mmp[1] == minor && mmp[2] >= patch)
} |
I've submitted a fix for this issue here: https://go-review.googlesource.com/c/sys/+/382314 |
Change https://go.dev/cl/560955 mentions this issue: |
Updates #37174 Updates #49233 Fixes #41152 Change-Id: I35b148c8bc132f02dd6a5a6bb48b711fb5c5df9e Reviewed-on: https://go-review.googlesource.com/c/go/+/560955 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
Change https://go.dev/cl/620256 mentions this issue: |
Would very much like to finally resolve and close out this very old issue. To that end, please see the newest changes to Many thanks for all of your suggestions and attention towards getting this one wrapped up. |
Darwin opmask clobbering bug was fixed in kernel version 21.3.0 as released in MacOS 12.2.0. This commit resolves issue by checking for Darwin AVX512 support via a sysctl call with the addition of a kernel minimum version check. The kernel version check is completed without adding new dependencies to x/sys/cpu. A sysctl call is accomplished by copying a minimal amount of code from x/sys/unix, to retrieve only the needed KERN_OSRELEASE value. This code is structured in the same manner as an existing analogous AIX/PPC64 syscall. The resulting dotted version string value is then parsed for numeric comparison with a dependency free function. All code in this contribution is structured to ease removal of the special darwin/amd64 codepaths when that OS/arch combination is eventually no longer supported by golang. Resolves issue: golang/go#49233, reinstates fix for issue: golang/go#43089 Change-Id: I4755fc8b3865eb6562b0959ecc910e2c46ac6cb4 Reviewed-on: https://go-review.googlesource.com/c/sys/+/620256 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: vsivsi@yahoo.com <vsivsi@yahoo.com>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
In writing AVX-512 assembly functions using the Go assembler (via Avo), I began to notice unexplained intermittent test failures. A lot of painful investigation revealed that golang async preemption does not seem to be properly save/restoring AVX512 state when assembly functions are interrupted. From visual inspection of the
runtime/preempt_amd64.s
source, it also seems likely that AVX/AVX2 state (YMM upper 128 bits) may not be properly save/restored either, but I have not encountered/tested that case in my own code.A complete minimal reproduction illustrating the clobbering of the AVX512 K1 opmask register is here:
https://gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32
Note, in my testing running with
GODEBUG=asyncpreemptoff=1
rescues every failure case I have identified.What did you expect to see?
I expected all relevant processor state to be properly restored in an assembly function following an async preemption.
What did you see instead?
Critical state is being clobbered, leading to intermittent undefined behavior.
The text was updated successfully, but these errors were encountered: