-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net,runtime: apparent deadlock in (*net.conn).Close and runtime.netpollblock on arm64 platforms #48925
Comments
Given that this appears to affect all arm64 platforms, it smells like a runtime or compiler bug to me. Notably, all of these stack traces seem to feature goroutines blocked in @jeremyfaller, could you suggest someone to route it to? |
Hmm. I feel like @cherrymui, @prattmic, or @dr2chase might be able to offer insights. |
If I recall, arm64 needs a few more memory fences than amd64. |
The That leaves:
|
Some notes:
In https://build.golang.org/log/93f6f9bde154a60a2dcad7c9a3e7b66f4bf2f41a, we have 2 goroutines using the same stuck FD:
The fact that the second goroutine is still blocked in |
I reproduced this on linux-arm64-aws quite quickly, so I'll dig in some more:
|
More logging:
My read of this is that The synchronization between |
One theory: Inside However, when consuming a ready event, In Thus I posit that an earlier I haven't validated this theory, but my log does show that this |
On Wed, Oct 13, 2021 at 2:31 PM Michael Pratt ***@***.***> wrote:
One theory:
Inside netpollblock and netpollunblock, gpp is read using a non-atomic
read. In the normal case, writes use atomics, creating a barrier.
However, when consuming a ready event, netpollblock resets gpp using a
non-atomic store
<https://cs.opensource.google/go/go/+/master:src/runtime/netpoll.go;l=428-430;drc=master>,
thus skipping a barrier.
In netpollunblock, we can return early if gpp == pdReady
<https://cs.opensource.google/go/go/+/master:src/runtime/netpoll.go;l=462-464;drc=master>,
not actually performing an unblock.
Thus I posit that an earlier netpollblock sees pdReady and resets to 0.
Later, on another core we read gpp and see the stale pdReady and
incorrectly return early without triggering a new unblock.
I haven't validated this theory, but my log does show that this pollDesc
has earlier calls to netpollblock, so it is possible. If this is the
case, I believe the fix is simply to use a atomic store to reset gpp to 0.
Well spotted. When netpollblock and netpollunblock were introduced, in
https://golang.org/cl/7569043, they were always called with the pd locked.
That changed in https://golang.org/cl/45700043. The early returns for
pdReady in netpollblock and netpollunblock suggests that both functions
need to use atomic loads and stores. At least, I don't see why they can
get away without using them.
Ian
|
Does this only affect arm64? |
It's a fair question why this would affect arm64 and not arm32. |
Using all atomic loads and stores does make this issue disappear, so I think that's it. Though I also can't explain why we haven't seen this on other architectures, especially arm32. |
Are aligned 32-bit stores always atomic (FSVO “atomic”) on arm32? Maybe the failure mode happens from a torn 64-bit |
fwiw, I think bash all.bash on my arm64 Chromebook fails on the same tests... I could probably test a possible fix... |
Change https://golang.org/cl/355952 mentions this issue: |
trying just this code change on top of a tip checkout doesn't fix tests failing on my Chromebook arm64. ofc might be unrelated issue(s)... I didn't give it enough time yet to fully isolate if this is fully the bug that makes the tests fail, but seeing the kind of failing it seems likely to me (mostly timeouts of tests, which seem like possible deadlocks to me, and the place where it failed now suspiciously looks like code touched by that change: FAIL runtime 541.176s, the stacktrace touching poll) Also I have a weak suspicion failing tests are different at times? I will try to keep a closer eye on it from now on xD if necessary I can probably paste some stack traces etc... |
@LaPingvino give the new version a try, though it sounds like you may be having a different issue. Logs would definitely help. @gopherbot please backport to 1.17 and 1.16. This is a pre-existing issue with no workaround, and causing flakes on trybots. |
Backport issue(s) opened: #49009 (for 1.16), #49010 (for 1.17). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
@prattmic it does seem to have fixed the tests failing on my debian arm64 in crostini (chromebook). runtime took a long time to test (431.701s) but it didn't fail this time. Another test fails atm but I don't think that one is related? now it's a cmd/go test that fails. |
Change https://golang.org/cl/356369 mentions this issue: |
Change https://golang.org/cl/356370 mentions this issue: |
…with atomics Both netpollblock and netpollunblock read gpp using a non-atomic load. When consuming a ready event, netpollblock clears gpp using a non-atomic store, thus skipping a barrier. Thus on systems with weak memory ordering, a sequence like so this is possible: T1 T2 1. netpollblock: read gpp -> pdReady 2. netpollblock: store gpp -> 0 3. netpollunblock: read gpp -> pdReady 4. netpollunblock: return i.e., without a happens-before edge between (2) and (3), netpollunblock may read the stale value of gpp. Switch these access to use atomic loads and stores in order to create these edges. For ease of future maintainance, I've simply changed rg and wg to always be accessed atomically, though I don't believe pollOpen or pollClose require atomics today. For #48925 Fixes #49009 Change-Id: I903ea667eea320277610b4f969129935731520c3 Reviewed-on: https://go-review.googlesource.com/c/go/+/355952 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: David Chase <drchase@google.com> (cherry picked from commit 1b072b3) Reviewed-on: https://go-review.googlesource.com/c/go/+/356370
…with atomics Both netpollblock and netpollunblock read gpp using a non-atomic load. When consuming a ready event, netpollblock clears gpp using a non-atomic store, thus skipping a barrier. Thus on systems with weak memory ordering, a sequence like so this is possible: T1 T2 1. netpollblock: read gpp -> pdReady 2. netpollblock: store gpp -> 0 3. netpollunblock: read gpp -> pdReady 4. netpollunblock: return i.e., without a happens-before edge between (2) and (3), netpollunblock may read the stale value of gpp. Switch these access to use atomic loads and stores in order to create these edges. For ease of future maintainance, I've simply changed rg and wg to always be accessed atomically, though I don't believe pollOpen or pollClose require atomics today. For #48925 Fixes #49010 Change-Id: I903ea667eea320277610b4f969129935731520c3 Reviewed-on: https://go-review.googlesource.com/c/go/+/355952 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: David Chase <drchase@google.com> (cherry picked from commit 1b072b3) Reviewed-on: https://go-review.googlesource.com/c/go/+/356369
greplogs --dashboard -md -l -e '(?m).*\[semacquire, \d+ minutes\]:\n(.+\n\t.+\n)*net\.\(\*conn\)\.Close'
2021-10-08T16:26:20-99c1b24/windows-arm64-10
2021-10-05T15:58:05-f1f626d/linux-arm64-aws
2021-09-30T18:10:18-e81a3d9-205640e/linux-arm64-packet
2021-09-29T18:36:18-e48cf0d/android-arm64-corellium
2021-09-28T18:58:50-7d9f5e0-9c43872/windows-arm64-10
2021-09-28T17:19:08-1c6e50a/android-arm64-corellium
2021-09-27T22:22:04-52b23a5/android-arm64-corellium
2021-09-27T19:42:34-4e4d966-315cec2/linux-arm64-packet
2021-09-24T18:21:14-f6b5ffb/android-arm64-corellium
2021-09-22T20:53:48-41bb744/linux-arm64-packet
2021-09-18T06:04:41-c084706-c894b44/windows-arm64-10
2021-09-17T18:33:15-70493b3/ios-arm64-corellium
2021-09-16T23:13:58-e09dcc2/android-arm64-corellium
2021-09-09T21:28:56-c163c31-c981874/linux-arm64-aws
2021-09-07T23:27:08-3fff213/windows-arm64-10
2021-09-03T01:38:54-9f69a44/ios-arm64-corellium
2021-09-02T19:36:22-a8aa6cf/windows-arm64-10
2021-09-02T19:16:19-9633195/ios-arm64-corellium
2021-09-02T19:16:19-9633195/linux-arm64-packet
2021-09-02T16:59:57-2a463a2/android-arm64-corellium
2021-09-02T14:02:42-b8420ba/ios-arm64-corellium
2021-09-01T22:28:21-0bfd6fc/android-arm64-corellium
2021-08-28T04:46:13-0108177/windows-arm64-10
2021-08-24T15:55:38-60bc85c-d70c69d/windows-arm64-10
2021-08-23T12:26:30-c7e354d/linux-arm64-packet
2021-08-21T00:55:22-e17439e/windows-arm64-10
2021-08-17T15:00:04-3001b0a/windows-arm64-10
2021-08-05T20:22:31-fd45e26/android-arm64-corellium
2021-07-31T23:59:40-c6fcb2d-b8ca6e5/windows-arm64-10
2021-07-26T22:15:24-bfbb288/ios-arm64-corellium
2021-07-20T19:54:36-d568e6e/ios-arm64-corellium
2021-07-16T02:43:48-aa4e0f5/linux-arm64-aws
2021-06-30T01:29:49-c45e800/windows-arm64-10
2021-05-04T14:38:36-5e4f9b0/linux-arm64-aws
2021-05-03T16:23:09-7b768d4/linux-arm64-aws
2021-04-29T17:08:05-e03cca6/linux-arm64-aws
2021-04-29T11:37:05-c4c68fb/linux-arm64-aws
2021-04-22T17:47:59-537cde0/linux-arm64-aws
2021-04-16T16:36:57-2f0e5bf/linux-arm64-packet
2021-04-14T03:15:00-2d4ba26/linux-arm64-packet
2021-04-13T21:13:09-e512bc2/ios-arm64-corellium
2021-04-09T18:49:05-c3faff7/linux-arm64-aws
2021-04-09T18:49:05-c3faff7/linux-arm64-packet
2021-04-09T14:33:44-a690a5d/android-arm64-corellium
2021-04-08T18:47:17-96a6745/linux-arm64-packet
2021-03-29T21:50:16-4e1bf8e/linux-arm64-packet
2021-03-15T23:47:56-661f3f1/linux-arm64-aws
2021-03-12T21:52:25-5ea612d-f39c4de/linux-arm64-aws
2021-03-05T22:14:48-a829114/linux-arm64-aws
2021-01-25T22:54:28-e6b6d10/android-arm64-corellium
2020-12-16T16:40:57-5abda26/android-arm64-corellium
2020-12-10T23:11:25-1fe891a/ios-arm64-corellium
2020-10-15T18:07:26-3c9488e/linux-arm64-aws
2020-09-10T14:44:25-8098dbb/linux-arm64-aws
The text was updated successfully, but these errors were encountered: