Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peering: Fix ping pong hang. #2099

Merged
merged 3 commits into from
Oct 14, 2021
Merged

peering: Fix ping pong hang. #2099

merged 3 commits into from
Oct 14, 2021

Conversation

JoeGruffins
Copy link
Member

Don't block in writeMessages when an error result cannot be sent. Ignore
the errors and let garbage collection take care of the go function.

Group context detecting blocks in one function of errgroup. Disconnect
with that error after return.

Also error with timeout if the ping portion of a ping pong cannot be
sent after a time limit.

closes #2097

p2p/peering.go Outdated Show resolved Hide resolved
@JoeGruffins JoeGruffins force-pushed the fixpeershang branch 2 times, most recently from c834ef4 to ecc872a Compare October 12, 2021 08:38
p2p/peering.go Outdated
Comment on lines 637 to 651
case <-gctx.Done():
return gctx.Err()
case <-time.After(pingInterval):
ctx, cancel := context.WithDeadline(ctx, time.Now().Add(15*time.Second))
// pingPong may cancel ctx which is the parent
// of gctx and cause the next iteration to follow
// gctx.Done().
rp.pingPong(ctx)
cancel()
}
}
})
err := g.Wait()
if err != nil {
rp.Disconnect(err)
}
rp.Disconnect(g.Wait())
rp.c.Close()
}
Copy link
Member Author

@JoeGruffins JoeGruffins Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little cleaner. We can disconnect with the error after waiting, unless pingPong already disconnected us. The case <-gctx.Done(): lets this function return as soon as any other errgroup functions throw an error rather than waiting for ctx, which was the root cause of the bug. I don't think there is a good reason to wait for <-ctx.Done() because I believe that would indicate that gctx is also canceled, being it's child, but I will test this out some more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that canceling the parent cancels this gctx: https://play.golang.org/p/F_ZVfP7SW2L

p2p/peering.go Outdated Show resolved Hide resolved
Don't block in writeMessages when an error result cannot be sent. Ignore
the errors and let garbage collection take care of the go function.

Group context detecting blocks in one function of errgroup. Disconnect
with that error after return.

Also error with timeout if the ping portion of a ping pong cannot be
sent after a time limit.
p2p/peering.go Outdated
Comment on lines 864 to 873
timeout := func() {
if ctx.Err() == context.DeadlineExceeded {
err := errors.E(errors.IO, "ping timeout")
rp.Disconnect(err)
}
}
select {
case <-ctx.Done():
timeout()
return
Copy link
Member Author

@JoeGruffins JoeGruffins Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also stop the bug on its own. I think it makes sense to also throw an error in case you couldn't send the initial ping. Maybe there was a reason for not disconnecting here?

p2p/peering.go Outdated
Comment on lines 428 to 430
// e blocking would indicate that either an error
// already happened or the context was canceled
// and the containing function has already returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or there's no receiver yet, which could happen if the goroutine executes before the select in the parent goroutine afaict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe thats the reason for the one capacity. But actually, because the capacity is one, I guess this already returns on first error as soon as it happens and does not block until the second?

Copy link
Member

@chappjc chappjc Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the reason, yeah, at least with a non-blocking send. It's subtle, but it's an anti-pattern, which go vet has recently learned to nag about in the specific case of the os.Signal channel given to signal.Notify because it internally has a non-blocking send just like this that allows a signal to be missed.

https://play.golang.org/p/A_yQTu1K7Jf <- run it several times to see the outcome changes, sometimes deadlocks

If this were a non-blocking send with an unbuffered channel, I don't think it would be guaranteed to work.
I haven't looked closely at your analysis yet so aside from this observation on the previously-proposed edit, I don't know what if anything was incorrect about writeMessages. jrick may need to weigh in on the intent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I understand. Thank you for the explanation. I'm still not sure what the implications are here. Because of the return right after, I don't think the behavior changes whether buffered or not, because the loop is destroyed and the select no longer listening, whether waiting for a receiver or allowing a return and keeping the error in the channel buffer. Actually, I guess I'll just leave this function how it was since it isn't part of the bug. I thought it was waiting for the error to be delivered at first, but because it is buffered it does not.

@JoeGruffins JoeGruffins marked this pull request as ready for review October 13, 2021 02:56
p2p/peering.go Outdated
Comment on lines 635 to 637
// pingPong may cancel ctx which is the parent
// of gctx and cause the next iteration to follow
// gctx.Done().
Copy link
Member

@chappjc chappjc Oct 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that pingPong may disconnect the remotePeer on timeout which in turn causes the other subtasks to return with an error which in turn cancels the group context gCtx?

Together with this select now receiving from gctx.Done(), which allows this subtask to return when another subtask error return cancels the group context, I see how the bug existed.

EDIT: OK yeah that's what you said in #2099 (comment), but this comment is super confusing because pingPong doesn't cancel ctx

Also it looks like it should have been context.WithDeadline(gctx rather than context.WithDeadline(ctx, which would allow pingPong to return before timeout if one of the other subtasks errors and thus cancels gctx (but not ctx).

To summarize, the minimal fix looks to me like:

diff --git a/p2p/peering.go b/p2p/peering.go
index f1d11bd2..0c908748 100644
--- a/p2p/peering.go
+++ b/p2p/peering.go
@@ -630,10 +630,10 @@ func (lp *LocalPeer) serveUntilError(ctx context.Context, rp *RemotePeer) {
        g.Go(func() error {
                for {
                        select {
-                       case <-ctx.Done():
-                               return ctx.Err()
+                       case <-gctx.Done():
+                               return gctx.Err()
                        case <-time.After(2 * time.Minute):
-                               ctx, cancel := context.WithDeadline(ctx, time.Now().Add(15*time.Second))
+                               ctx, cancel := context.WithDeadline(gctx, time.Now().Add(15*time.Second))
                                rp.pingPong(ctx)
                                cancel()
                        }

(with the extra timeout() in pingPong)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like the proper fix to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, shall I just do the minimum then?

Copy link
Member

@chappjc chappjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had this exact diff running for about 10 hrs on a testnet SPV ticketbuying+mixing wallet without trouble.

Have not tried in in the simnet harness where I first saw the hang however, but it's certainly not breaking anything afaict.

Does the repro diff you posted in #2097 (comment) still hit it with this patch @JoeGruffins?

I believe any added Disconnects inside pingPong aren't needed because if ctx is canceled or there is any subtask error in serveUntilError, rp.Disconnect is called anyway. At least it looks that way. I'd hate to make superfluous changes that break something new in a subtle way even though your previous pingPongedits seemed reasonable. jrick's call though ofc.

Good sleuthing @JoeGruffins.

@jrick jrick merged commit fca8d1b into decred:master Oct 14, 2021
@JoeGruffins
Copy link
Member Author

JoeGruffins commented Oct 15, 2021

Does the repro diff you posted in #2097 (comment) still hit it with this patch @JoeGruffins?

yes

I really wanted to squash the *squash* commit...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spv wallet sync lost
3 participants