peering: Fix ping pong hang. #2099

JoeGruffins · 2021-10-12T08:19:59Z

Don't block in writeMessages when an error result cannot be sent. Ignore
the errors and let garbage collection take care of the go function.

Group context detecting blocks in one function of errgroup. Disconnect
with that error after return.

Also error with timeout if the ping portion of a ping pong cannot be
sent after a time limit.

closes #2097

p2p/peering.go

JoeGruffins · 2021-10-12T08:40:59Z

p2p/peering.go

+			case <-gctx.Done():
+				return gctx.Err()
+			case <-time.After(pingInterval):
 				ctx, cancel := context.WithDeadline(ctx, time.Now().Add(15*time.Second))
+				// pingPong may cancel ctx which is the parent
+				// of gctx and cause the next iteration to follow
+				// gctx.Done().
 				rp.pingPong(ctx)
 				cancel()
 			}
 		}
 	})
-	err := g.Wait()
-	if err != nil {
-		rp.Disconnect(err)
-	}
+	rp.Disconnect(g.Wait())
+	rp.c.Close()
 }


I think it's a little cleaner. We can disconnect with the error after waiting, unless pingPong already disconnected us. The case <-gctx.Done(): lets this function return as soon as any other errgroup functions throw an error rather than waiting for ctx, which was the root cause of the bug. I don't think there is a good reason to wait for <-ctx.Done() because I believe that would indicate that gctx is also canceled, being it's child, but I will test this out some more.

Confirming that canceling the parent cancels this gctx: https://play.golang.org/p/F_ZVfP7SW2L

p2p/peering.go

Don't block in writeMessages when an error result cannot be sent. Ignore the errors and let garbage collection take care of the go function. Group context detecting blocks in one function of errgroup. Disconnect with that error after return. Also error with timeout if the ping portion of a ping pong cannot be sent after a time limit.

JoeGruffins · 2021-10-12T08:51:09Z

p2p/peering.go

+	timeout := func() {
+		if ctx.Err() == context.DeadlineExceeded {
+			err := errors.E(errors.IO, "ping timeout")
+			rp.Disconnect(err)
+		}
+	}
 	select {
 	case <-ctx.Done():
+		timeout()
 		return


This could also stop the bug on its own. I think it makes sense to also throw an error in case you couldn't send the initial ping. Maybe there was a reason for not disconnecting here?

chappjc · 2021-10-12T17:11:49Z

p2p/peering.go

+				// e blocking would indicate that either an error
+				// already happened or the context was canceled
+				// and the containing function has already returned.


...or there's no receiver yet, which could happen if the goroutine executes before the select in the parent goroutine afaict

Maybe thats the reason for the one capacity. But actually, because the capacity is one, I guess this already returns on first error as soon as it happens and does not block until the second?

That's the reason, yeah, at least with a non-blocking send. It's subtle, but it's an anti-pattern, which go vet has recently learned to nag about in the specific case of the os.Signal channel given to signal.Notify because it internally has a non-blocking send just like this that allows a signal to be missed.

https://play.golang.org/p/A_yQTu1K7Jf <- run it several times to see the outcome changes, sometimes deadlocks

If this were a non-blocking send with an unbuffered channel, I don't think it would be guaranteed to work.
I haven't looked closely at your analysis yet so aside from this observation on the previously-proposed edit, I don't know what if anything was incorrect about writeMessages. jrick may need to weigh in on the intent

I believe I understand. Thank you for the explanation. I'm still not sure what the implications are here. Because of the return right after, I don't think the behavior changes whether buffered or not, because the loop is destroyed and the select no longer listening, whether waiting for a receiver or allowing a return and keeping the error in the channel buffer. Actually, I guess I'll just leave this function how it was since it isn't part of the bug. I thought it was waiting for the error to be delivered at first, but because it is buffered it does not.

chappjc · 2021-10-13T14:31:23Z

p2p/peering.go

+				// pingPong may cancel ctx which is the parent
+				// of gctx and cause the next iteration to follow
+				// gctx.Done().


Do you mean that pingPong may disconnect the remotePeer on timeout which in turn causes the other subtasks to return with an error which in turn cancels the group context gCtx?

Together with this select now receiving from gctx.Done(), which allows this subtask to return when another subtask error return cancels the group context, I see how the bug existed.

EDIT: OK yeah that's what you said in #2099 (comment), but this comment is super confusing because pingPong doesn't cancel ctx

Also it looks like it should have been context.WithDeadline(gctx rather than context.WithDeadline(ctx, which would allow pingPong to return before timeout if one of the other subtasks errors and thus cancels gctx (but not ctx).

To summarize, the minimal fix looks to me like:

diff --git a/p2p/peering.go b/p2p/peering.go index f1d11bd2..0c908748 100644 --- a/p2p/peering.go +++ b/p2p/peering.go @@ -630,10 +630,10 @@ func (lp *LocalPeer) serveUntilError(ctx context.Context, rp *RemotePeer) { g.Go(func() error { for { select { - case <-ctx.Done(): - return ctx.Err() + case <-gctx.Done(): + return gctx.Err() case <-time.After(2 * time.Minute): - ctx, cancel := context.WithDeadline(ctx, time.Now().Add(15*time.Second)) + ctx, cancel := context.WithDeadline(gctx, time.Now().Add(15*time.Second)) rp.pingPong(ctx) cancel() }

(with the extra timeout() in pingPong)

this looks like the proper fix to me

Ok, shall I just do the minimum then?

chappjc

I've had this exact diff running for about 10 hrs on a testnet SPV ticketbuying+mixing wallet without trouble.

Have not tried in in the simnet harness where I first saw the hang however, but it's certainly not breaking anything afaict.

Does the repro diff you posted in #2097 (comment) still hit it with this patch @JoeGruffins?

I believe any added Disconnects inside pingPong aren't needed because if ctx is canceled or there is any subtask error in serveUntilError, rp.Disconnect is called anyway. At least it looks that way. I'd hate to make superfluous changes that break something new in a subtle way even though your previous pingPongedits seemed reasonable. jrick's call though ofc.

Good sleuthing @JoeGruffins.

JoeGruffins · 2021-10-15T01:17:48Z

Does the repro diff you posted in #2097 (comment) still hit it with this patch @JoeGruffins?

yes

I really wanted to squash the *squash* commit...

JoeGruffins commented Oct 12, 2021

View reviewed changes

p2p/peering.go Outdated Show resolved Hide resolved

JoeGruffins force-pushed the fixpeershang branch 2 times, most recently from c834ef4 to ecc872a Compare October 12, 2021 08:38

JoeGruffins commented Oct 12, 2021

View reviewed changes

p2p/peering.go Outdated Show resolved Hide resolved

JoeGruffins force-pushed the fixpeershang branch from ecc872a to dbba18b Compare October 12, 2021 08:50

JoeGruffins commented Oct 12, 2021

View reviewed changes

chappjc reviewed Oct 12, 2021

View reviewed changes

*squash* Revert writeMessages change.

0d03b5e

JoeGruffins force-pushed the fixpeershang branch from 6ab0478 to 0d03b5e Compare October 13, 2021 02:54

JoeGruffins marked this pull request as ready for review October 13, 2021 02:56

chappjc reviewed Oct 13, 2021

View reviewed changes

*squash* The minimum.

fca8d1b

chappjc approved these changes Oct 14, 2021

View reviewed changes

jrick approved these changes Oct 14, 2021

View reviewed changes

jrick merged commit fca8d1b into decred:master Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

peering: Fix ping pong hang. #2099

peering: Fix ping pong hang. #2099

JoeGruffins commented Oct 12, 2021

JoeGruffins Oct 12, 2021 •

edited

Loading

JoeGruffins Oct 13, 2021

JoeGruffins Oct 12, 2021 •

edited

Loading

chappjc Oct 12, 2021

JoeGruffins Oct 13, 2021

chappjc Oct 13, 2021 •

edited

Loading

JoeGruffins Oct 13, 2021

chappjc Oct 13, 2021 •

edited

Loading

jrick Oct 13, 2021

JoeGruffins Oct 13, 2021

chappjc left a comment •

edited

Loading

JoeGruffins commented Oct 15, 2021 •

edited

Loading

peering: Fix ping pong hang. #2099

peering: Fix ping pong hang. #2099

Conversation

JoeGruffins commented Oct 12, 2021

JoeGruffins Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

JoeGruffins Oct 13, 2021

Choose a reason for hiding this comment

JoeGruffins Oct 12, 2021 • edited Loading

Choose a reason for hiding this comment

chappjc Oct 12, 2021

Choose a reason for hiding this comment

JoeGruffins Oct 13, 2021

Choose a reason for hiding this comment

chappjc Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

JoeGruffins Oct 13, 2021

Choose a reason for hiding this comment

chappjc Oct 13, 2021 • edited Loading

Choose a reason for hiding this comment

jrick Oct 13, 2021

Choose a reason for hiding this comment

JoeGruffins Oct 13, 2021

Choose a reason for hiding this comment

chappjc left a comment • edited Loading

Choose a reason for hiding this comment

JoeGruffins commented Oct 15, 2021 • edited Loading

JoeGruffins Oct 12, 2021 •

edited

Loading

JoeGruffins Oct 12, 2021 •

edited

Loading

chappjc Oct 13, 2021 •

edited

Loading

chappjc Oct 13, 2021 •

edited

Loading

chappjc left a comment •

edited

Loading

JoeGruffins commented Oct 15, 2021 •

edited

Loading