-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqd: fix full topic sync accumulation #993
nsqd: fix full topic sync accumulation #993
Conversation
9c0bd91
to
67d70f8
Compare
That looks like a good fix. But I think the explanation isn't quite right.
The problem/fix is for when connecting seemed to succeed, but sending the So, I'm skeptical that go-routines accumulated. I think the loop between Command(), connectCallback(), and syncTopicChan, and back to Command() and back to connectCallback(), is to blame for the cpu usage. I also wonder if anything different should be done for I also wonder if we should replace the go routines which just wait to send on a channel, with a non-blocking write to a channel with enough capacity for one message for each LookupPeer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that said, adding return
here looks correct.
Thanks @andyxning for the proposed fix and @ploxiln for the analysis. I'd like to think through this before we land it, if y'all don't mind, but I need a few days to have some time to do so. |
@mreiferson That's ok to me. Hope you merge this asap. :) |
ping @mreiferson in case you forget this. :) |
This fix looks fine, but this suggestion from @ploxiln also seems useful:
In general this particular set of code is a mess and could probably use some refactoring. After re-reading it, I can't seem to figure out how on earth a failed connection gets cleaned up. |
IIUC, you mean that according to the suggestion from @ploxiln , you can not figure out that how to stop failed connections from accumulating? |
Ping @mreiferson |
It appears that the When The This all looks like it works as desired. The problem which this PR fixes is when @andyxning I think mreiferson was waiting to see if you were interested in doing some further small cleanups of this area, highlighted in the discussion. |
After looking again, maybe I see some good cause for confusion. As described above, the Lines 93 to 120 in e6debab
|
@ploxiln That's is the truth. The Identify command fails to lookupd.
OK, but before that refactor, can we first review and merge this PR?
That's also the truth. We should check the returned error. But i can not make sure that nsqlookupd will also fail with magic string. The better solution is to both check for the errors and fail fast. Maybe we can do this in another PR. @ploxiln @mreiferson WDYT. |
Ping @mreiferson :) |
Sorry for the delay @andyxning! |
This PR will avoid the topic sync accumulation when error in trying to connect to nsqlookupd for register or unregister.
Currently, if nsqd can not do register/unregister a topic/channel, the
lp
address is sent tosyncTopicChan
. BecausesyncTopicChan
is a blocked channel, the goroutine for writinglp
tosyncTopicChan
will be accumulation. After the network to nsqlookupd is ok, the accumulated requests insyncTopicChan
will be consumed by nsqd and a full topic/channel register will be operated. This will cause a performance downgrade for a busy nsqd instance.The proper way to handle identify error is just logging and returning instead of still writing
lp
tosyncTopicChan
./cc @mreiferson @jehiah @ploxiln