Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: fuzz reports spurious "process hung or terminated unexpectedly" errors #56238

Open
rsc opened this issue Oct 14, 2022 · 12 comments
Open
Labels
fuzz Issues related to native fuzzing support NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Oct 14, 2022

go test -v -fuzz=Decode image/gif

consistently produces output like:

...
=== RUN   TestEncodeWrappedImage
--- PASS: TestEncodeWrappedImage (0.06s)
=== FUZZ  FuzzDecode
fuzz: elapsed: 0s, gathering baseline coverage: 0/5 completed
fuzz: elapsed: 3s, gathering baseline coverage: 2/5 completed
fuzz: elapsed: 6s, gathering baseline coverage: 4/5 completed
fuzz: elapsed: 9s, gathering baseline coverage: 5/5 completed, now fuzzing with 16 workers
fuzz: elapsed: 9s, execs: 25 (7/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 12s, execs: 26 (0/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 15s, execs: 28 (1/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 18s, execs: 28 (0/sec), new interesting: 0 (total: 5)
fuzz: elapsed: 20s, execs: 29 (1/sec), new interesting: 0 (total: 5)
--- FAIL: FuzzDecode (19.88s)
    fuzzing process hung or terminated unexpectedly: exit status 2
    Failing input written to testdata/fuzz/FuzzDecode/a601188e726161766a8b077333fe62493bc92397a4c3bf5b205ac46814029453
    To re-run:
    go test -run=FuzzDecode/a601188e726161766a8b077333fe62493bc92397a4c3bf5b205ac46814029453
FAIL

Of course the test case changes each time, but when I rerun it using 'go test' the test case passes fine. This happens in the current Go dev branch too. I tried Go 1.19 to make sure it wasn't related to changes I have made to package testing.

@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 14, 2022
@rsc rsc added this to the Go1.20 milestone Oct 14, 2022
@ZackaryWelch
Copy link

I've seen this a lot too, exact same status and the failed test passing on rerun. I see two place this occurs in source, one marked with a to-do. Other discussions mention this could be a bug as well.

@ZackaryWelch
Copy link

TODO in source links to #48127 cmd/go: -keepfuzzing needs renaming, does not exist

@gopherbot gopherbot modified the milestones: Go1.20, Go1.21 Feb 1, 2023
@gopherbot gopherbot modified the milestones: Go1.21, Go1.22 Aug 8, 2023
@mprimi
Copy link

mprimi commented Oct 9, 2023

@julieqiu @toothrot should this be tagged fuzz?
(issue still present exactly as described as of 1.20.x)

@mprimi
Copy link

mprimi commented Oct 11, 2023

I have a repro for this issue, as described by @rsc.

In my specific case usually fails within seconds during the gathering baseline coverage phase.

In my case it looks like a race condition on coordinator/worker pipe:

fuzzing process hung or terminated unexpectedly: exit status 2

@katiehockman @rolandshoemaker @jayconrod @bcmills : I see you authored most of the fuzzing code. Any chance you want to take a look? I don't want to share my repro here, but I'm happy to share privately.

@bcmills bcmills added the fuzz Issues related to native fuzzing support label Oct 12, 2023
@bcmills
Copy link
Contributor

bcmills commented Oct 12, 2023

@mprimi, note that of all the folks you've tagged only @rolandshoemaker and I are still on the Go team. 😅

(@golang/fuzzing is the right entity to tag for this sort of issue.)

@bcmills
Copy link
Contributor

bcmills commented Oct 12, 2023

@mprimi, I'm not sure I quite follow. What triggers the suspected race condition? (Is it caused by a worker that finds a crashing input, and finishes crashing before the coordinator has read that input?)

@mprimi
Copy link

mprimi commented Oct 24, 2023

@bcmills
The original issue (by @rsc) describes a case where the fuzzer misbehaves and terminates with fuzzing process hung or terminated unexpectedly: exit status 2.
It leaves behind a seed, but upon re-running the same, it passes. Something funny going on.

This is NOT a case of:

  • Fuzzed test failing -- that is handled just fine
  • Fuzzed test panic/crash -- that is also handled and produces a different error

The issue described seems like a bug in the fuzzer itself where something goes wrong at the worker level, but it's not clear what.

I posted a response here because:

  • I have a repro for this issue. (In case the original repro, go test -v -fuzz=Decode image/gif, no longer reproduces the issue)
  • I have spent some time looking into this. From what I can tell, looks like a race condition between fuzzer coordinator and worker: one of the two is trying to read some (serialized) JSON from a pipe before it's available, and it crashes with an EOF.

If anyone from @golang/fuzzing is willing to take a look, I can share a (GitHub) link to a test that reproduces the issue reliably. (even if it's public code, I'll share that link privately, I don't want to link to my repro from this issue).

@0xalpharush
Copy link

0xalpharush commented Nov 30, 2023

Any resolution or update on this? It's blocking my usage of go fuzzing

EDIT
Using -parallel=1 seemed to prevent the crashes, but that's significantly slower

I ran the test case through delve and set a breakpoint where the crash seems to originate.

  1. Compile fuzz test into executable with go test -c -o test -fuzz MyFuzzTest -gcflags=all="-N -l" ./mytests/...
  2. Run dlv exec ./test -- -test.v -test.fuzz MyFuzzTest -test.run MyFuzzTest -test.fuzzcachedir ./fuzz/cache
  3. enter b /opt/homebrew/Cellar/go@1.20/1.20.11/libexec/src/internal/fuzz/worker.go:186
Frame 1: /opt/homebrew/Cellar/go@1.20/1.20.11/libexec/src/internal/fuzz/worker.go:186 (PC: 102cc84c4)
   181:                                         return err
   182:                                 }
   183:                                 // Unexpected termination. Set error message and fall through.
   184:                                 // We'll restart the worker on the next iteration.
   185:                                 // Don't attempt to minimize this since it crashed the worker.
=> 186:                                 resp.Err = fmt.Sprintf("fuzzing process hung or terminated unexpectedly: %v", w.waitErr)
   187:                                 canMinimize = false
   188:                         }
   189:                         result := fuzzResult{
   190:                                 limit:         input.limit,
   191:                                 count:         resp.Count,
(dlv) print w.waitErr
error(*os/exec.ExitError) *{
        ProcessState: *os.ProcessState {
                pid: 50169,
                status: 512,
                rusage: *(*syscall.Rusage)(0x14000740000),},
        Stderr: []uint8 len: 0, cap: 0, nil,}
(dlv) print err
error(*errors.errorString) *{s: "EOF"}

EDIT: OS info: Darwin arm64

@AlekSi
Copy link
Contributor

AlekSi commented Jan 17, 2024

I wonder if that issue is somehow architecture-specific? I hit it almost every minute on Mac mini with an Intel process, but can't replicate it with the same code on Apple silicon.

EDIT: never mind, it fails in the same way on Apple silicon, it just takes hours, not minutes

@maxammann
Copy link

I think there are two related issues that get mixed up. The cause for the fuzzer crashing can either be:

  1. OOM which results in the client worker getting killed (check your dmesg output for OOM issues). Likely that is the issue here.
  2. The client worker experiences a panic and exits with exit code 2.

I was able to extract the following stack trace for case 2 by using strace:

panic: deadlocked!

goroutine 19 [running]:
internal/fuzz.RunFuzzWorker.func1.1()
	/usr/lib/go-1.21/src/internal/fuzz/worker.go:493 +0x25
created by time.goFunc
	/usr/lib/go-1.21/src/time/sleep.go:176 +0x2d

After this crash the worker process exits with exit code 2 according to strace.
I created a workaround patch that avoids this crash and at least for image/gif the fuzzer crashes go away.

This is the commit that introduced the behavior of crashing if a test case takes longer than 10s to execute.

We still need a real fix though to avoid the whole fuzzer to stop when this panic is thrown. There seems to be another bug somewhere. Potentially it has to do with the fact that the panic is thrown in a timer.

@maxammann
Copy link

If you want to help test this workaround:

@maxammann
Copy link

To summarize: This is not a bug but expected behavior until #48157 is fixed. I would recommend closing this issue and continuing working on a PR for that.

Maybe we should improve the error message. We can add a field to the serialized data sent between the worker and the coordinator that indicates a hang. Mabye difficult though because we would need to interrupt the current worker and then return an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzz Issues related to native fuzzing support NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

8 participants