-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: hangs in TestGdbBacktrace on linux #37405
Comments
It shows that runtime test failed due to timeout.
Maybe we should dump |
Change https://golang.org/cl/227811 mentions this issue: |
This turns out not to be specific to the 2021-01-23T19:46:06-9897655/linux-amd64-sid |
linux-mips64le-mengzhuo
builder
2021-07-19T13:27:46-49402be/linux-amd64-staticlockranking |
2021-10-13T15:11:16-0454d73/linux-s390x-ibm |
2021-12-08T04:14:00-a19e72c/linux-amd64-buster |
Marking as release-blocker for Go 1.18 due to the regularity of failures, and especially the variety of builders on which failures have been observed. The large number of affected platforms suggests either a problem in the test (which we should fix) or a deep-rooted bug in Note that many of the failures are on (CC @jeremyfaller) |
We could implement our own timeout in TestGdbBacktrace so it can fail cleanly and print the output it has so far from GDB. |
Change https://golang.org/cl/370703 mentions this issue: |
Change https://golang.org/cl/370701 mentions this issue: |
Change https://golang.org/cl/370702 mentions this issue: |
Change https://golang.org/cl/370665 mentions this issue: |
This lifts the logic to run a subcommand with a timeout in a test from the runtime's runTestProg into testenv. The implementation is unchanged in this CL. We'll improve it in a future CL. Currently, tests that run subcommands usually just timeout with no useful output if the subcommand runs for too long. This is a step toward improving this. For #37405. Change-Id: I2298770db516e216379c4c438e05d23cbbdda51d Reviewed-on: https://go-review.googlesource.com/c/go/+/370701 Trust: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>
This makes testenv.RunWithTimeout first attempt to SIGQUIT the subprocess to get a useful Go traceback, but if that doesn't work, it sends a SIGKILL instead to make sure we tear down the subprocess. This is potentially important for non-Go subprocesses. For #37405. Change-Id: I9e7e118dc5769ec3f45288a71658733bff30c9cd Reviewed-on: https://go-review.googlesource.com/c/go/+/370702 Trust: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
This sometimes times out and we don't have any useful output for debugging it. Hopefully this will help. For #37405. Change-Id: I79074e6fbb9bd16a864c651109a0acbfc8aa6cef Reviewed-on: https://go-review.googlesource.com/c/go/+/370703 Trust: Austin Clements <austin@google.com> Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Hmm. TestGdbBacktrace hasn't failed since I landed the changes to add a clean timeout. |
Still no more failures. I'm not sure what to make of this. I'm pretty sure the timeout code I added works because I did an experiment where I changed the test to run gdb on "sleep 10" and changed the timeout to 1 second and it did in fact kill gdb. |
Since we've at least made tangible progress on diagnosing the problem during the 1.18 cycle, I think it would be ok to move this back to the Backlog milestone and/or mark it WaitingForInfo while we wait for another repro. It's unfortunate but not terribly surprising for flaky tests not to reproduce as often during the code freeze, because the rate of test runs (especially for fast and/or scalable builders) tends to be much higher during the active development window. |
It does indeed work! We have a hit today:
2022-02-03T17:24:54-7f9494c/linux-386-longtest:
|
2022-03-16T05:32:52-d34287a/linux-riscv64-unmatched
(Note that the Feb. 3 failure was on |
Given the "[Inferior 1 (process 1173835) exited normally]" at the end of the GDB output, this is either a bug in GDB where it doesn't properly exit, or the test is somehow missing the fact that GDB is exiting. I think the "gdb exited with error: signal: killed" indicates that the GDB process was still around to be killed, but I'm not entirely sure what happens if you send a signal to a zombie. If this is a GDB bug, that's unfortunate. We could work around it by looking at the GDB output as its running and killing it if it looks complete enough, or by just using a short timeout and accepting correct output even if it timed out. |
Digging around a bit, it looks like signaling a zombie process will not change its exit status, so this is GDB failing to exit when its inferior exits. |
Another one with a very similar failure mode: GDB logs
|
Change https://go.dev/cl/411117 mentions this issue: |
Change https://go.dev/cl/445596 mentions this issue: |
Change https://go.dev/cl/445597 mentions this issue: |
For most tests, the test's deadline itself is more appropriate than an arbitrary timeout layered atop of it (especially once #48157 is implemented), and testenv.Command already adds cleaner timeout behavior when a command would run too close to the test's deadline. That makes RunWithTimeout something of an attractive nuisance. For now, migrate the two existing uses of it to testenv.CommandContext, with a shorter timeout implemented using context.WithTimeout. As a followup, we may want to drop the extra timeouts from these invocations entirely. Updates #50436. Updates #37405. Change-Id: I16840fd36c0137b6da87ec54012b3e44661f0d08 Reviewed-on: https://go-review.googlesource.com/c/go/+/445597 Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/447495 mentions this issue: |
For most tests, the test's deadline itself is more appropriate than an arbitrary timeout layered atop of it (especially once golang#48157 is implemented), and testenv.Command already adds cleaner timeout behavior when a command would run too close to the test's deadline. That makes RunWithTimeout something of an attractive nuisance. For now, migrate the two existing uses of it to testenv.CommandContext, with a shorter timeout implemented using context.WithTimeout. As a followup, we may want to drop the extra timeouts from these invocations entirely. Updates golang#50436. Updates golang#37405. Change-Id: I16840fd36c0137b6da87ec54012b3e44661f0d08 Reviewed-on: https://go-review.googlesource.com/c/go/+/445597 Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
…Backtrace This may fix the TestEINTR failures that have been frequent on the riscv64 builders since CL 445597. Updates #37405. Updates #39043. Change-Id: Iaf1403ff5ce2ff0203d5d0059908097d32d0b217 Reviewed-on: https://go-review.googlesource.com/c/go/+/447495 Auto-Submit: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Bryan Mills <bcmills@google.com>
2020-02-22T04:31:41-059a5ac/linux-mips64le-mengzhuo
CC @dr2chase @aclements @mengzhuo
The text was updated successfully, but these errors were encountered: