-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqd: fix ephemeral channel sub live lock #1314
Conversation
1a3a6f6
to
adc3e0d
Compare
539d64b
to
ec2c44f
Compare
@ploxiln review this if you're reviewing things 😏 |
} | ||
|
||
if (channel.ephemeral && channel.Exiting()) || (topic.ephemeral && topic.Exiting()) { | ||
channel.RemoveClient(client.ID) | ||
time.Sleep(1 * time.Millisecond) | ||
continue | ||
if i < 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only two attempts ... could do a couple more to be "safe" ... but ephemeral channel clients can retry too, and already give up guarantees. ok, makes sense
return nil, protocol.NewFatalClientErr(nil, "E_TOO_MANY_CHANNEL_CONSUMERS", | ||
fmt.Sprintf("channel consumers for %s:%s exceeds limit of %d", | ||
topicName, channelName, p.nsqd.getOpts().MaxChannelConsumers)) | ||
return nil, protocol.NewFatalClientErr(err, "E_SUB_FAILED", "SUB failed "+err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, now this is where the exiting will be checked, and we won't retry for ephemeral channels exiting as intended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, if we looked up a channel that is indeed exiting we want this to fail here.
ec2c44f
to
62d202f
Compare
@jehiah last chance before I land this 😏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I've not reviewed exhaustively, but this LGTM
Fixes #1251
I noticed a few things while debugging this...
The reproducible steps from #1251 caused file descriptor exhaustion in
nsqd
on my macOS laptop (default configuration) The stacktrace fromnsqd
confirmed what had previously been reported, a ton of contention onnsqd.RWMutex
, but I noticed this in particular:This was the goroutine ultimately holding the lock. We don't need to be persisting metadata for ephemeral topics/channels, so after changing that it "fixed" the issue in my testing. It would still be possible to construct a pathological test to cause this, so I additionally limited the retry loop and increased the timeout (which, IMO, is the "correct" behavior here anyway).
I fixed a few other things while debugging...