Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: hangs in TestGdbBacktrace on linux #37405

Closed
bcmills opened this issue Feb 24, 2020 · 24 comments
Closed

runtime: hangs in TestGdbBacktrace on linux #37405

bcmills opened this issue Feb 24, 2020 · 24 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Linux release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Feb 24, 2020

2020-02-22T04:31:41-059a5ac/linux-mips64le-mengzhuo

goroutine 23401 [syscall, 11 minutes]:
syscall.Syscall6(0x1475, 0x1, 0x3b6f, 0xc000cad968, 0x1000004, 0x0, 0x0, 0x120153878, 0xc000cad960, 0x120080418)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/syscall/asm_linux_mips64x.s:40 +0x10 fp=0xc000cad910 sp=0xc000cad908 pc=0x1200cc678
os.(*Process).blockUntilWaitable(0xc00001ab70, 0x0, 0x1200103ec, 0x0)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/wait_waitid.go:31 +0x88 fp=0xc000cad9f8 sp=0xc000cad910 pc=0x1200e49e8
os.(*Process).wait(0xc00001ab70, 0x1202d17d0, 0x1202d17d8, 0x1202d17c8)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec_unix.go:22 +0x4c fp=0xc000cada68 sp=0xc000cad9f8 pc=0x1200df43c
os.(*Process).Wait(...)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec.go:125
os/exec.(*Cmd).Wait(0xc000ad8f20, 0x0, 0x0)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:502 +0x68 fp=0xc000cadad8 sp=0xc000cada68 pc=0x120153e18
os/exec.(*Cmd).Run(0xc000ad8f20, 0xc00009aff0, 0xc000ad8f20)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:340 +0x74 fp=0xc000cadaf8 sp=0xc000cadad8 pc=0x12015340c
os/exec.(*Cmd).CombinedOutput(0xc000ad8f20, 0x3, 0xc000cade78, 0xf, 0xf, 0xc000ad8f20)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:562 +0xbc fp=0xc000cadb20 sp=0xc000cadaf8 pc=0x1201541dc
runtime_test.TestGdbBacktrace(0xc00022a5a0)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/runtime/runtime-gdb_test.go:388 +0x6c4 fp=0xc000cadf80 sp=0xc000cadb20 pc=0x120202e64
testing.tRunner(0xc00022a5a0, 0x1202d2e00)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/testing/testing.go:992 +0xf8 fp=0xc000cadfc8 sp=0xc000cadf80 pc=0x12010a978
runtime.goexit()
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/runtime/asm_mips64x.s:646 +0x4 fp=0xc000cadfc8 sp=0xc000cadfc8 pc=0x120084354
created by testing.(*T).Run
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/testing/testing.go:1043 +0x378

CC @dr2chase @aclements @mengzhuo

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 24, 2020
@bcmills bcmills added this to the Backlog milestone Feb 24, 2020
@mengzhuo
Copy link
Contributor

It shows that runtime test failed due to timeout.

FAIL	runtime	720.011s
FAIL

Maybe we should dump ps ef and system load after bots failed?

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/227811 mentions this issue: runtime: only shrink stack while preempting

@bcmills
Copy link
Contributor Author

bcmills commented Jan 25, 2021

This turns out not to be specific to the mips64le builder. See also #39228 (occasional failures instead of hangs).

2021-01-23T19:46:06-9897655/linux-amd64-sid
2020-11-02T03:03:16-0387bed/linux-386-softfloat
2020-06-25T12:02:38-334752d/linux-amd64-staticlockranking

@bcmills bcmills changed the title runtime: hang in TestGdbBacktrace on linux-mips64le-mengzhuo builder runtime: hangs in TestGdbBacktrace on linux Jan 25, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Oct 14, 2021

greplogs --dashboard -md -l -e '(?m)panic: test timed out(?:.|\n)*os/exec\...Cmd.*\n.*\nruntime_test\.TestGdbBacktrace' --since=2021-07-20

2021-10-13T15:11:16-0454d73/linux-s390x-ibm
2021-10-12T18:52:28-26b6833/linux-386-softfloat
2021-09-27T18:57:20-3d795ea/linux-amd64
2021-08-26T17:19:04-166b691/linux-386-softfloat
2021-08-12T19:06:51-46fd547/linux-amd64-noregabi

@bcmills
Copy link
Contributor Author

bcmills commented Dec 8, 2021

greplogs --dashboard -md -l -e '(?m)panic: test timed out(?:.|\n)*os/exec\...Cmd.*\n.*\nruntime_test\.TestGdbBacktrace' --since=2021-10-15

2021-12-08T04:14:00-a19e72c/linux-amd64-buster
2021-11-16T17:13:33-29ec902/linux-386-softfloat
2021-11-13T00:49:51-530e320/solaris-amd64-oraclerel
2021-10-25T20:08:25-252324e/linux-386-stretch

@bcmills
Copy link
Contributor Author

bcmills commented Dec 8, 2021

Marking as release-blocker for Go 1.18 due to the regularity of failures, and especially the variety of builders on which failures have been observed. The large number of affected platforms suggests either a problem in the test (which we should fix) or a deep-rooted bug in gdb itself (for which we should either add a workaround or remove the infeasible test).

Note that many of the failures are on linux/386 and linux/amd64, which are first-class ports.

(CC @jeremyfaller)

@bcmills bcmills added okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 release-blocker labels Dec 8, 2021
@bcmills bcmills modified the milestones: Backlog, Go1.18 Dec 8, 2021
@aclements
Copy link
Member

We could implement our own timeout in TestGdbBacktrace so it can fail cleanly and print the output it has so far from GDB.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/370703 mentions this issue: runtime: run gdb with a timeout for TestGdbBacktrace

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/370701 mentions this issue: testenv: abstract run-with-timeout into testenv

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/370702 mentions this issue: testenv: kill subprocess if SIGQUIT doesn't do it

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/370665 mentions this issue: runtime: use testenv.RunWithTimeout liberally

gopherbot pushed a commit that referenced this issue Dec 12, 2021
This lifts the logic to run a subcommand with a timeout in a test from
the runtime's runTestProg into testenv. The implementation is
unchanged in this CL. We'll improve it in a future CL.

Currently, tests that run subcommands usually just timeout with no
useful output if the subcommand runs for too long. This is a step
toward improving this.

For #37405.

Change-Id: I2298770db516e216379c4c438e05d23cbbdda51d
Reviewed-on: https://go-review.googlesource.com/c/go/+/370701
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
gopherbot pushed a commit that referenced this issue Dec 12, 2021
This makes testenv.RunWithTimeout first attempt to SIGQUIT the
subprocess to get a useful Go traceback, but if that doesn't work, it
sends a SIGKILL instead to make sure we tear down the subprocess. This
is potentially important for non-Go subprocesses.

For #37405.

Change-Id: I9e7e118dc5769ec3f45288a71658733bff30c9cd
Reviewed-on: https://go-review.googlesource.com/c/go/+/370702
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Dec 12, 2021
This sometimes times out and we don't have any useful output for
debugging it. Hopefully this will help.

For #37405.

Change-Id: I79074e6fbb9bd16a864c651109a0acbfc8aa6cef
Reviewed-on: https://go-review.googlesource.com/c/go/+/370703
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@aclements
Copy link
Member

Hmm. TestGdbBacktrace hasn't failed since I landed the changes to add a clean timeout. greplogs -dashboard -md -l -e "TestGdbBacktrace" -since=2021-12-09 returns nothing.

@aclements
Copy link
Member

Still no more failures. I'm not sure what to make of this. I'm pretty sure the timeout code I added works because I did an experiment where I changed the test to run gdb on "sleep 10" and changed the timeout to 1 second and it did in fact kill gdb.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 11, 2022

Since we've at least made tangible progress on diagnosing the problem during the 1.18 cycle, I think it would be ok to move this back to the Backlog milestone and/or mark it WaitingForInfo while we wait for another repro.

It's unfortunate but not terribly surprising for flaky tests not to reproduce as often during the code freeze, because the rate of test runs (especially for fast and/or scalable builders) tends to be much higher during the active development window.

@bcmills
Copy link
Contributor Author

bcmills commented Feb 3, 2022

Still no more failures. I'm not sure what to make of this. I'm pretty sure the timeout code I added works

It does indeed work! We have a hit today:

greplogs --dashboard -md -e 'FAIL: TestGdbBacktrace (?:.*\n .*)* signal: killed' --since=2021-12-10

2022-02-03T17:24:54-7f9494c/linux-386-longtest:

--- FAIL: TestGdbBacktrace (450.59s)
    runtime-gdb_test.go:76: gdb version 7.12
    testenv.go:361: [gdb -nx -batch -iex add-auto-load-safe-path /workdir/go/src/runtime -ex set startup-with-shell off -ex break main.eee -ex run -ex backtrace -ex continue /workdir/tmp/TestGdbBacktrace1861977496/001/a.exe] exit status: signal: killed
    runtime-gdb_test.go:428: gdb output:
        Loading Go Runtime support.
        Breakpoint 1 at 0x809bac0: file /workdir/tmp/TestGdbBacktrace1861977496/001/main.go, line 17.
        [New LWP 74304]
        [New LWP 74309]
        [New LWP 74310]
        
        Thread 1 "a.exe" hit Breakpoint 1, main.eee (~r0=<optimized out>) at /workdir/tmp/TestGdbBacktrace1861977496/001/main.go:17
        17	func eee() bool { return true }
        #0  main.eee (~r0=<optimized out>) at /workdir/tmp/TestGdbBacktrace1861977496/001/main.go:17
        #1  0x0809ba9f in main.ddd (~r0=<optimized out>) at /workdir/tmp/TestGdbBacktrace1861977496/001/main.go:14
        #2  0x0809ba6a in main.ccc (~r0=<optimized out>) at /workdir/tmp/TestGdbBacktrace1861977496/001/main.go:11
        #3  0x0809ba3a in main.bbb (~r0=<optimized out>) at /workdir/tmp/TestGdbBacktrace1861977496/001/main.go:8
        #4  0x0809ba0a in main.aaa (~r0=<optimized out>) at /workdir/tmp/TestGdbBacktrace1861977496/001/main.go:5
        #5  0x0809bafa in main.main () at /workdir/tmp/TestGdbBacktrace1861977496/001/main.go:22
        [LWP 74276 exited]
        [LWP 74310 exited]
        [LWP 74309 exited]
        [Inferior 1 (process 74276) exited normally]
    runtime-gdb_test.go:430: gdb exited with error: signal: killed
FAIL
FAIL	runtime

@bcmills
Copy link
Contributor Author

bcmills commented Mar 17, 2022

greplogs --dashboard -md -l -e 'FAIL: TestGdbBacktrace (?:.*\n .*)* signal: killed' --since=2022-02-04

2022-03-16T05:32:52-d34287a/linux-riscv64-unmatched

--- FAIL: TestGdbBacktrace (363.09s)
    runtime-gdb_test.go:76: gdb version 10.1
    testenv.go:361: [gdb -nx -batch -iex add-auto-load-safe-path /tmp/workdir-host-linux-riscv64-unmatched/go/src/runtime -ex set startup-with-shell off -ex break main.eee -ex run -ex backtrace -ex continue /tmp/workdir-host-linux-riscv64-unmatched/tmp/TestGdbBacktrace506085885/001/a.exe] exit status: signal: killed
    runtime-gdb_test.go:428: gdb output:
        Loading Go Runtime support.
        Breakpoint 1 at 0x65ba8: file /tmp/workdir-host-linux-riscv64-unmatched/tmp/TestGdbBacktrace506085885/001/main.go, line 17.
        [New LWP 1173854]
        [New LWP 1173855]
        [New LWP 1173858]
        
        Thread 1 "a.exe" hit Breakpoint 1, 0x0000000000065ba8 in main.eee (~r0=<optimized out>) at /tmp/workdir-host-linux-riscv64-unmatched/tmp/TestGdbBacktrace506085885/001/main.go:17
        17	func eee() bool { return true }
        #0  0x0000000000065ba8 in main.eee (~r0=<optimized out>) at /tmp/workdir-host-linux-riscv64-unmatched/tmp/TestGdbBacktrace506085885/001/main.go:17
        #1  0x0000000000065b88 in main.ddd (~r0=<optimized out>) at /tmp/workdir-host-linux-riscv64-unmatched/tmp/TestGdbBacktrace506085885/001/main.go:14
        #2  0x0000000000065b3c in main.ccc (~r0=<optimized out>)
        #3  0x0000000000065afc in main.bbb (~r0=<optimized out>)
        #4  0x0000000000065abc in main.aaa (~r0=<optimized out>)
        #5  0x0000000000065bcc in main.main () at /tmp/workdir-host-linux-riscv64-unmatched/tmp/TestGdbBacktrace506085885/001/main.go:22
        [LWP 1173855 exited]
        [LWP 1173854 exited]
        [LWP 1173835 exited]
        [Inferior 1 (process 1173835) exited normally]
    runtime-gdb_test.go:438: gdb exited with error: signal: killed
FAIL
FAIL	runtime	446.758s

(Note that the Feb. 3 failure was on linux/386, which is a first-class port.)

@bcmills bcmills modified the milestones: Backlog, Go1.19 Mar 17, 2022
@heschi heschi added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 11, 2022
@aclements
Copy link
Member

Given the "[Inferior 1 (process 1173835) exited normally]" at the end of the GDB output, this is either a bug in GDB where it doesn't properly exit, or the test is somehow missing the fact that GDB is exiting. I think the "gdb exited with error: signal: killed" indicates that the GDB process was still around to be killed, but I'm not entirely sure what happens if you send a signal to a zombie.

If this is a GDB bug, that's unfortunate. We could work around it by looking at the GDB output as its running and killing it if it looks complete enough, or by just using a short timeout and accepting correct output even if it timed out.

@aclements
Copy link
Member

Digging around a bit, it looks like signaling a zombie process will not change its exit status, so this is GDB failing to exit when its inferior exits.

@bcmills
Copy link
Contributor Author

bcmills commented May 16, 2022

Another one with a very similar failure mode: GDB logs Inferior 1 (…) exited normally and then hangs.

greplogs --dashboard -md -l -e 'FAIL: TestGdbBacktrace (?:.*\n .*)* signal: killed' --since=2022-03-17
2022-05-13T16:02:27-0f715f1/linux-386-sid

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/411117 mentions this issue: runtime: skip TestGdbBacktrace on gdb bug

Repository owner moved this from Todo to Done in Go Compiler / Runtime Jun 8, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/445596 mentions this issue: internal/testenv: add CommandContext in place of RunWithTimeout

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/445597 mentions this issue: internal/testenv: remove RunWithTimout

gopherbot pushed a commit that referenced this issue Oct 31, 2022
For most tests, the test's deadline itself is more appropriate than an
arbitrary timeout layered atop of it (especially once #48157 is
implemented), and testenv.Command already adds cleaner timeout
behavior when a command would run too close to the test's deadline.

That makes RunWithTimeout something of an attractive nuisance. For
now, migrate the two existing uses of it to testenv.CommandContext,
with a shorter timeout implemented using context.WithTimeout.

As a followup, we may want to drop the extra timeouts from these
invocations entirely.

Updates #50436.
Updates #37405.

Change-Id: I16840fd36c0137b6da87ec54012b3e44661f0d08
Reviewed-on: https://go-review.googlesource.com/c/go/+/445597
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/447495 mentions this issue: runtime: eliminate arbitrary timeouts in runBuiltTestProg and TestGdbBacktrace

romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
For most tests, the test's deadline itself is more appropriate than an
arbitrary timeout layered atop of it (especially once golang#48157 is
implemented), and testenv.Command already adds cleaner timeout
behavior when a command would run too close to the test's deadline.

That makes RunWithTimeout something of an attractive nuisance. For
now, migrate the two existing uses of it to testenv.CommandContext,
with a shorter timeout implemented using context.WithTimeout.

As a followup, we may want to drop the extra timeouts from these
invocations entirely.

Updates golang#50436.
Updates golang#37405.

Change-Id: I16840fd36c0137b6da87ec54012b3e44661f0d08
Reviewed-on: https://go-review.googlesource.com/c/go/+/445597
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Nov 4, 2022
…Backtrace

This may fix the TestEINTR failures that have been frequent on the
riscv64 builders since CL 445597.

Updates #37405.
Updates #39043.

Change-Id: Iaf1403ff5ce2ff0203d5d0059908097d32d0b217
Reviewed-on: https://go-review.googlesource.com/c/go/+/447495
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
@golang golang locked and limited conversation to collaborators Nov 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-Linux release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants