-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: TestGdbPython flaky on linux #24616
Comments
@hyangah, is this similar to the bug you fixed recently about getting the state of goroutines? |
@aclements do you mean https://go-review.googlesource.com/c/go/+/49691? If when the test passes the gdb output should look like the following --- PASS: TestGdbPython (0.44s) runtime-gdb_test.go:59: gdb version 7.7 runtime-gdb_test.go:193: gdb output: Loading Go Runtime support. Loaded Script Yes /tmp/workdir/go/src/runtime/runtime-gdb.py Breakpoint 1 at 0x47c1c0: file /tmp/workdir/go/src/fmt/print.go, line 263. Breakpoint 1, fmt.Println (a=..., err=..., n=) at /tmp/workdir/go/src/fmt/print.go:263 263 func Println(a ...interface{}) (n int, err error) { BEGIN info goroutines * 1 running runtime.systemstack_switch 2 waiting runtime.gopark 17 waiting runtime.gopark 33 runnable runtime.runfinq END #1 0x00000000004828d0 in main.main () at /tmp/go-build668958384/main.go:14 14 fmt.Println("hi") BEGIN print mapvar $1 = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"} END BEGIN print strvar $2 = "abc" END BEGIN info locals mapvar = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"} slicevar = []string = {"def"} strvar = "abc" END #0 fmt.Println (a=..., err=..., n=) at /tmp/workdir/go/src/fmt/print.go:263 263 func Println(a ...interface{}) (n int, err error) { BEGIN goroutine 1 bt #0 fmt.Println (a=..., err=..., n=) at /tmp/workdir/go/src/fmt/print.go:263 #1 0x00000000004828d0 in main.main () at /tmp/go-build668958384/main.go:14 END BEGIN goroutine 2 bt #0 runtime.gopark (lock=0x528af0 , reason="force gc (idle)", traceEv=20 '\024', traceskip=1, unlockf=) at /tmp/workdir/go/src/runtime/proc.go:292 #1 0x0000000000428c5e in runtime.goparkunlock (lock=, reason=..., traceEv=, traceskip=) at /tmp/workdir/go/src/runtime/proc.go:297 #2 0x00000000004289ea in runtime.forcegchelper () at /tmp/workdir/go/src/runtime/proc.go:248 #3 0x000000000044ee01 in runtime.goexit () at /tmp/workdir/go/src/runtime/asm_amd64.s:1385 #4 0x0000000000000000 in ?? () END Breakpoint 2 at 0x4828fd: file /tmp/go-build668958384/main.go, line 18. hi Breakpoint 2, main.main () at /tmp/go-build668958384/main.go:19 19 } // END_OF_PROGRAM BEGIN goroutine 1 bt at the end #0 main.main () at /tmp/go-build668958384/main.go:19 END Note the difference in the output of 'info goroutine' (2 running goroutines vs 1 running goroutine). Is there any way to reliably reproduce the failure case? |
Here's a flake with very similar output on |
|
The failure here happens because at the time the gdb script tries to display information about goroutine 2, that goroutine no longer exists. Is that a bug because the goroutine has ended, or just normal behavior? Someone who knows more about what causes goroutines to come and go would have to answer that. It seems like the testcase should have a goroutine that won't go away soon and reference that in the gdb script. |
|
|
Time to mark as flaky and skip? |
Or, y'know, figure out why the test is failing and fix it... |
@bcmills I don't think that snarkiness is warranted. It is obviously better to fix tests than skip them. It is also better skip tests than to have random flakes, which waste peoples' time, mask real failures, and condition people not to bother investigating the dots of red on build.golang.org. This has been sitting without a fix for two years; it was originally marked Go 1.11. I think that's well over the threshold for marking as flaky, unless there's a concrete plan for who is going to look into it. |
Skipping the test ~everywhere is tantamount to deleting it, especially given that we do not currently have a way to detect unnecessary skips (#25951), and especially for tests that do not reliably fail. Skipping the test only on Linux would assume that the source of flakiness is linux-specific, but I do not know whether we have enough Skipping the test due to flakiness on only the builders would not fix the potential for flakes when folks run So even adding an appropriate If the test does not provide enough value to warrant the time from someone with enough context to either fix it or add an appropriately-tailored |
As I noted earlier in this issue, the gdb script is expecting goroutine 2 will exist throughout this test so it tries to do 'bt 2'. At the point it does that gdb thinks this is an error because it can't find goroutine 2. My guess is that goroutine 2 is gone (for whatever reason) when it tried to do the 'bt 2', which I don't think is a gdb error if that is the case. But in any case I think if you remove the line that does 'bt 2' (skip it) from the gdb script then the error should go away and you will still be testing gdb to some extent. That is my guess anyway. I looked at this test a while back because it was failing intermittently on ppc64le for a while. |
From looking at the source code that it is running, I can't tell what goroutine 2 is even supposed to be. It looks entirely single-threaded to me. Do you know? const helloSource = `
import "fmt"
import "runtime"
var gslice []string
func main() {
mapvar := make(map[string]string, 13)
mapvar["abc"] = "def"
mapvar["ghi"] = "jkl"
strvar := "abc"
ptrvar := &strvar
slicevar := make([]string, 0, 16)
slicevar = append(slicevar, mapvar["abc"])
fmt.Println("hi")
runtime.KeepAlive(ptrvar)
_ = ptrvar
gslice = slicevar
runtime.KeepAlive(mapvar)
} // END_OF_PROGRAM
` |
@josharian I got the same impression when I looked over the test. It seems that the test is making some sort of attempt to see if a different goroutine can be selected for a backtrace, but relying on runtime-specific goroutines to do that? |
Something like this might work, but I can't test particularly readily. diff --git a/src/runtime/runtime-gdb_test.go b/src/runtime/runtime-gdb_test.go
index 79b4621614..9f2ada1032 100644
--- a/src/runtime/runtime-gdb_test.go
+++ b/src/runtime/runtime-gdb_test.go
@@ -106,6 +106,7 @@ import "fmt"
import "runtime"
var gslice []string
func main() {
+ go func() { select{} }() // ensure a second goroutine is running
mapvar := make(map[string]string, 13)
mapvar["abc"] = "def"
mapvar["ghi"] = "jkl"
@@ -115,7 +116,7 @@ func main() {
slicevar = append(slicevar, mapvar["abc"])
fmt.Println("hi")
runtime.KeepAlive(ptrvar)
- _ = ptrvar
+ _ = ptrvar // set breakpoint here
gslice = slicevar
runtime.KeepAlive(mapvar)
} // END_OF_PROGRAM
@@ -167,6 +168,16 @@ func testGdbPython(t *testing.T, cgo bool) {
src := buf.Bytes()
+ // Locate breakpoint line
+ var bp int
+ lines := bytes.Split(src, []byte("\n"))
+ for i, line := range lines {
+ if bytes.Contains(line, []byte("breakpoint")) {
+ bp = i
+ break
+ }
+ }
+
err = ioutil.WriteFile(filepath.Join(dir, "main.go"), src, 0644)
if err != nil {
t.Fatalf("failed to create file: %v", err)
@@ -201,7 +212,7 @@ func testGdbPython(t *testing.T, cgo bool) {
}
args = append(args,
"-ex", "set python print-stack full",
- "-ex", "br main.go:15",
+ "-ex", fmt.Sprintf("br main.go:%d", bp),
"-ex", "run",
"-ex", "echo BEGIN info goroutines\n",
"-ex", "info goroutines", |
I tested Josh's change and it seems to work. I am happy to send a CL with this change (or Josh can send one) if folks think that it would be helpful. FWIW I also spent a while trying to reproduce the original flake on my workstation... lots of testing but no luck... so if I can't produce the problem in the first place, it's hard to know whether this new version is a real improvement. |
Thanks, @thanm. I’ll send a CL soonish and claim that it fixes this issue. And if the issue occurs again, we can re-open. |
@thanm if you're still up for it, feel free to send a CL. I'm AFK for an unknown time. |
OK, I will add it to my list (may not get to it for a while, I have other pots on the stove). |
Change https://golang.org/cl/226558 mentions this issue: |
Tweak the runtime's GDB python test to try to reduce flake failures. Background: the intent of the testpoint in question is to make sure that python-supported commands like "info goroutines" or "goroutine 1 backtrace" work properly. The Go code being run under the debugger as part of the test is single-threaded, but the test is written assuming that in addition to the primary goroutine there will be other background goroutines available (owned by the runtime). The flakiness seems to crop up the most when requesting a backtrace for one of these background goroutines; the speculation is that if we catch a runtime-owned goroutine in an odd state, this could interfere with the test. The change in this patch is to explicitly start an additional goroutine from the main thread, so that when the debugger stops the main thread we can be sure that there is some other non-main goroutine in a known state. This change authored by Josh Bleecher Snyder <josharian@gmail.com>. Updates #24616. Change-Id: I45682323d5898e5187c0adada7c5d117e92f403b Reviewed-on: https://go-review.googlesource.com/c/go/+/226558 Run-TryBot: Than McIntosh <thanm@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Bryan C. Mills <bcmills@google.com>
This happened again: https://storage.googleapis.com/go-build-log/72c918bb/linux-amd64-race_026d95f9.log Still no goroutine 2. Looks like there actually is a goroutine 2, doing bgsweep. Does gdb prevent us from doing bt on a runtime goroutine? |
In the log:
I think each #0 is the top (bottom?) of the stack for a goroutine. Note at the top it still says No goroutine 2. The bgsweep is part of the stack for another goroutine, and based what was shown above, that was goroutine 3. It still seems that goroutine 2 has exited by the time the second bt is attempted. |
What then is it even trying to test? Perhaps we should just delete the bt 2? We cannot reliably identify goroutines just by their number. I guess the alternative is to do some python scripting to parse ‘bt all’, identify the goroutine of interest, and bt it. Or we could parse ‘bt all’ and backtrace all live goroutines. I’m mostly inclined to delete ‘bt 2’. Opinions? |
Agree on the analysis; I am find with deleting the bt 2. Parsing "bt all" or "info goroutines" to find a specific goroutine seems like overkill. |
My assumption is that the purpose of the 'bt 2' was just to test the backtrace output. If you leave in the 'bt all' and remove 'bt 2' that should test it? I honestly don't know why goroutines would come and go but if it is gone that's not an error with gdb python but an expectation of the test. |
Still flaky after CL 226558, unfortunately:
|
Yep. See the last few comments above, which include a plan for moving forward. I’m AFK now but feel free to send a CL. Should be a simple one. |
Change https://golang.org/cl/233942 mentions this issue: |
Just saw this TestGdbPython flake on linux-amd64:
https://storage.googleapis.com/go-build-log/149e81d1/linux-amd64_fd28318f.log
/cc @aclements
The text was updated successfully, but these errors were encountered: