-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: sometimes 100% CPU spin during init phase in Go 1.14 with preemptive scheduler #37741
Comments
How often does this happen? Is it possible to reproduce by trying it some number of times? Does it happen with All imported packages get initialized before In the screenshot you posted, the process name is |
That's a good point about You're probably right... although I haven't seen it happen with |
To clarify, package initialization is done in a deterministic order, as defined in the Go specification. I agree it might be hard to know what |
Yes, but it's pretty opaque in practice :) i.e. it's not trivial to know in a large project the exact order of all instances of |
@dmitshur Do you know of a way I can run a profile that starts before the init phase? Based on your feedback, I'm 99% sure this is a bad init() function or bad |
Starting the profiler early would be hard. Can you run under a debugger, then just ctrl-C when it hangs and see where the code is executing? |
Nevermind! Gopher Slack is awesome. Someone gave me a tip: |
@dmitshur @randall77 The plot thickens, as I can only reproduce the bug in Go 1.14 and not Go 1.13: census-instrumentation/opencensus-go#1200 I suspect this might be a bug in the Go scheduler? One or more goroutines are blocking on Any advice on where to go from here? I wish I could make the reproduction more minimal, but I just haven't figured it out as it seems dependent on importing (and using) these packages (even though the "using" code doesn't even get reached). |
go run
(sometimes)
Also /cc @aclements per owners. |
As per census-instrumentation/opencensus-go#1200 (comment) it seems pretty clear that this is likely a bug in the Go 1.14 preemptive scheduler. When I disable it, the bug does not manifest. |
I cannot reproduce:
Runs 1000 times without fail. |
@randall77 Thanks for trying! That's a pretty old commit, that doesn't add the dependencies which trigger the bug. Please try the linked branch, currently at commit |
Ok, now I can reproduce. Thanks. |
Also reproduces at tip. Seems to be stuck in an endless attempt-to-preempt loop.
It will occasionally fail (and without the The stack trace it prints is simple:
The line number fluctuates from run to run. It seems to always be in a big map initializer, that looks like this:
With over 2000+ lines. Each entry is inserted separately, so there are ~2000 chunks of assembly like this:
@cherryyz, how do I tell whether the code here is marked as preemptable? |
If I look at the places where preempt is attempted (change the
Then lots of that same address repeating, with an occasional one of these:
The common line is a call to the write barrier code, for a check in that big map constructor of the gc enabled bit. That is indeed a non-preemptable location.
I don't see how, if we resumed at 0x1bd0ff8, we'd ever get back to that same address again. As long as we made at least one instruction of progress.
My hunch is now that the thread is somehow not making any progress from |
(haven't read the whole thread, just answer this first)
Compiler's |
This patch seems to fix the bug:
I suspect that we're issuing We should probably rate limit This definitely seems to be a fundamental problem with async preempt + give up and retry... It would be really nice to ask the OS to "interrupt the target thread only after it has been given at least a bit of time slice". |
Attached is a single-file repro. It hangs pretty reliably when run 100 times. |
Here's a pretty simple fix.
Now I just need a decent test. The repro above is fine for one-offs, but it's too expensive for a checked-in test. |
Edit: time -p ./make.bash w./o the fix: w. the fix: ============================================================= // command that hangs
|
Change https://golang.org/cl/223240 mentions this issue: |
@shawn-xdji , could you try the CL above on your example? |
@gopherbot please open a backport issue to 1.14.1. |
Backport issue(s) opened: #37833 (for 1.14). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Hi @randall77 thanks, build time backs to normal and is even better now, 'sys' is below 80s consistently, compared with 100s or so without the fix. |
Change https://golang.org/cl/223737 mentions this issue: |
Change https://golang.org/cl/223939 mentions this issue: |
…e is a signal pending If multiple threads call preemptone to preempt the same M, it may send many signals to the same M such that it hardly make progress, causing live-lock problem. Only send a signal if there isn't already one pending. Updates #37741. Fixes #37833. Change-Id: Id94adb0b95acbd18b23abe637a8dcd81ab41b452 Reviewed-on: https://go-review.googlesource.com/c/go/+/223737 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> (cherry picked from commit 0c0e8f2) Reviewed-on: https://go-review.googlesource.com/c/go/+/223939 Reviewed-by: Austin Clements <austin@google.com>
…14 in git env As seen in trouble shooting go-gitea#11032 the new feature of Go 1.14 is causing several second delays in startup in certain situations. Debugging shows it spending several seconds handling SIGURG commands during init: ``` 6922:04:51.984234 trace init() ./modules/queue/unique_queue_wrapped.go remote: ) = 69 <0.000012> remote: [pid 15984] 22:04:51 write(1, "\ttime taken: 236.761\302\265s\n\n", 25 time taken: 236.761µs remote: remote: ) = 25 <0.000011> remote: [pid 15984] 22:04:51 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- ``` This causes up to 20 seconds added to a push in some cases as it happens for each call of the gitea hook command. This is likely the cause of go-gitea#10661 as well and would start to effect users once we release 1.12 which would be the first release compiled with Go 1.14. I suspect this is just a slight issue with the upstream implementatation as there have been a few very similar bugs fixed and reported: golang/go#37741 golang/go#37942 We should revisit this in the future and see if a newer version of Go has solved it, but for now disable this option in the environment that gitea hook runs in to avoid it.
…14 in git env (#11237) As seen in trouble shooting #11032 the new feature of Go 1.14 is causing several second delays in startup in certain situations. Debugging shows it spending several seconds handling SIGURG commands during init: ``` 6922:04:51.984234 trace init() ./modules/queue/unique_queue_wrapped.go remote: ) = 69 <0.000012> remote: [pid 15984] 22:04:51 write(1, "\ttime taken: 236.761\302\265s\n\n", 25 time taken: 236.761µs remote: remote: ) = 25 <0.000011> remote: [pid 15984] 22:04:51 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- ``` This causes up to 20 seconds added to a push in some cases as it happens for each call of the gitea hook command. This is likely the cause of #10661 as well and would start to effect users once we release 1.12 which would be the first release compiled with Go 1.14. I suspect this is just a slight issue with the upstream implementatation as there have been a few very similar bugs fixed and reported: golang/go#37741 golang/go#37942 We should revisit this in the future and see if a newer version of Go has solved it, but for now disable this option in the environment that gitea hook runs in to avoid it.
…14 in git env (go-gitea#11237) As seen in trouble shooting go-gitea#11032 the new feature of Go 1.14 is causing several second delays in startup in certain situations. Debugging shows it spending several seconds handling SIGURG commands during init: ``` 6922:04:51.984234 trace init() ./modules/queue/unique_queue_wrapped.go remote: ) = 69 <0.000012> remote: [pid 15984] 22:04:51 write(1, "\ttime taken: 236.761\302\265s\n\n", 25 time taken: 236.761µs remote: remote: ) = 25 <0.000011> remote: [pid 15984] 22:04:51 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- remote: [pid 15984] 22:04:52 --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=15984, si_uid=0} --- ``` This causes up to 20 seconds added to a push in some cases as it happens for each call of the gitea hook command. This is likely the cause of go-gitea#10661 as well and would start to effect users once we release 1.12 which would be the first release compiled with Go 1.14. I suspect this is just a slight issue with the upstream implementatation as there have been a few very similar bugs fixed and reported: golang/go#37741 golang/go#37942 We should revisit this in the future and see if a newer version of Go has solved it, but for now disable this option in the environment that gitea hook runs in to avoid it.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Ran
go run main.go
, in my case, specifically in this folder.What did you expect to see?
The program should run.
What did you see instead?
Sometimes, the CPU spins at 100% and the entry point of the program is
never reachedreached after anywhere from 5-50s (but there are probably no hard bounds on this, it's just what I've observed).This is part of my typical development cycle where I write some code, then run the program, write more code, run the program, etc. Sometimes it happens, sometimes it doesn't. Even without changes, re-running the command will work. I can't pin it down, unfortunately, nor make a simpler test case.
I verified that the entry point
main()
is never reached, by changingmain()
so it is now:and can verify that "Got here" is never printed. Instead, the CPU spins:
In the past, @whitestrake has reported a similar phenomenon independently, but I only started noticing it recently myself. So, I know it's not isolated to just my machine.
Let me know what more would be useful to debug this. Thanks!
The text was updated successfully, but these errors were encountered: