-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
peering: Fix ping pong hang. #2099
Conversation
c834ef4
to
ecc872a
Compare
p2p/peering.go
Outdated
case <-gctx.Done(): | ||
return gctx.Err() | ||
case <-time.After(pingInterval): | ||
ctx, cancel := context.WithDeadline(ctx, time.Now().Add(15*time.Second)) | ||
// pingPong may cancel ctx which is the parent | ||
// of gctx and cause the next iteration to follow | ||
// gctx.Done(). | ||
rp.pingPong(ctx) | ||
cancel() | ||
} | ||
} | ||
}) | ||
err := g.Wait() | ||
if err != nil { | ||
rp.Disconnect(err) | ||
} | ||
rp.Disconnect(g.Wait()) | ||
rp.c.Close() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little cleaner. We can disconnect with the error after waiting, unless pingPong
already disconnected us. The case <-gctx.Done():
lets this function return as soon as any other errgroup functions throw an error rather than waiting for ctx, which was the root cause of the bug. I don't think there is a good reason to wait for <-ctx.Done()
because I believe that would indicate that gctx is also canceled, being it's child, but I will test this out some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming that canceling the parent cancels this gctx: https://play.golang.org/p/F_ZVfP7SW2L
Don't block in writeMessages when an error result cannot be sent. Ignore the errors and let garbage collection take care of the go function. Group context detecting blocks in one function of errgroup. Disconnect with that error after return. Also error with timeout if the ping portion of a ping pong cannot be sent after a time limit.
ecc872a
to
dbba18b
Compare
p2p/peering.go
Outdated
timeout := func() { | ||
if ctx.Err() == context.DeadlineExceeded { | ||
err := errors.E(errors.IO, "ping timeout") | ||
rp.Disconnect(err) | ||
} | ||
} | ||
select { | ||
case <-ctx.Done(): | ||
timeout() | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also stop the bug on its own. I think it makes sense to also throw an error in case you couldn't send the initial ping. Maybe there was a reason for not disconnecting here?
p2p/peering.go
Outdated
// e blocking would indicate that either an error | ||
// already happened or the context was canceled | ||
// and the containing function has already returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or there's no receiver yet, which could happen if the goroutine executes before the select in the parent goroutine afaict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe thats the reason for the one capacity. But actually, because the capacity is one, I guess this already returns on first error as soon as it happens and does not block until the second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the reason, yeah, at least with a non-blocking send. It's subtle, but it's an anti-pattern, which go vet
has recently learned to nag about in the specific case of the os.Signal
channel given to signal.Notify
because it internally has a non-blocking send just like this that allows a signal to be missed.
https://play.golang.org/p/A_yQTu1K7Jf <- run it several times to see the outcome changes, sometimes deadlocks
If this were a non-blocking send with an unbuffered channel, I don't think it would be guaranteed to work.
I haven't looked closely at your analysis yet so aside from this observation on the previously-proposed edit, I don't know what if anything was incorrect about writeMessages. jrick may need to weigh in on the intent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I understand. Thank you for the explanation. I'm still not sure what the implications are here. Because of the return right after, I don't think the behavior changes whether buffered or not, because the loop is destroyed and the select no longer listening, whether waiting for a receiver or allowing a return and keeping the error in the channel buffer. Actually, I guess I'll just leave this function how it was since it isn't part of the bug. I thought it was waiting for the error to be delivered at first, but because it is buffered it does not.
6ab0478
to
0d03b5e
Compare
p2p/peering.go
Outdated
// pingPong may cancel ctx which is the parent | ||
// of gctx and cause the next iteration to follow | ||
// gctx.Done(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that pingPong
may disconnect the remotePeer
on timeout which in turn causes the other subtasks to return with an error which in turn cancels the group context gCtx
?
Together with this select
now receiving from gctx.Done()
, which allows this subtask to return when another subtask error return cancels the group context, I see how the bug existed.
EDIT: OK yeah that's what you said in #2099 (comment), but this comment is super confusing because pingPong
doesn't cancel ctx
Also it looks like it should have been context.WithDeadline(gctx
rather than context.WithDeadline(ctx
, which would allow pingPong to return before timeout if one of the other subtasks errors and thus cancels gctx (but not ctx).
To summarize, the minimal fix looks to me like:
diff --git a/p2p/peering.go b/p2p/peering.go
index f1d11bd2..0c908748 100644
--- a/p2p/peering.go
+++ b/p2p/peering.go
@@ -630,10 +630,10 @@ func (lp *LocalPeer) serveUntilError(ctx context.Context, rp *RemotePeer) {
g.Go(func() error {
for {
select {
- case <-ctx.Done():
- return ctx.Err()
+ case <-gctx.Done():
+ return gctx.Err()
case <-time.After(2 * time.Minute):
- ctx, cancel := context.WithDeadline(ctx, time.Now().Add(15*time.Second))
+ ctx, cancel := context.WithDeadline(gctx, time.Now().Add(15*time.Second))
rp.pingPong(ctx)
cancel()
}
(with the extra timeout()
in pingPong
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like the proper fix to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, shall I just do the minimum then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had this exact diff running for about 10 hrs on a testnet SPV ticketbuying+mixing wallet without trouble.
Have not tried in in the simnet harness where I first saw the hang however, but it's certainly not breaking anything afaict.
Does the repro diff you posted in #2097 (comment) still hit it with this patch @JoeGruffins?
I believe any added Disconnect
s inside pingPong
aren't needed because if ctx
is canceled or there is any subtask error in serveUntilError
, rp.Disconnect
is called anyway. At least it looks that way. I'd hate to make superfluous changes that break something new in a subtle way even though your previous pingPong
edits seemed reasonable. jrick's call though ofc.
Good sleuthing @JoeGruffins.
yes I really wanted to squash the |
closes #2097