Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: fix full topic sync accumulation #993

Merged

Conversation

andyxning
Copy link
Member

This PR will avoid the topic sync accumulation when error in trying to connect to nsqlookupd for register or unregister.

Currently, if nsqd can not do register/unregister a topic/channel, the lp address is sent to syncTopicChan. Because syncTopicChan is a blocked channel, the goroutine for writing lp to syncTopicChan will be accumulation. After the network to nsqlookupd is ok, the accumulated requests in syncTopicChan will be consumed by nsqd and a full topic/channel register will be operated. This will cause a performance downgrade for a busy nsqd instance.

The proper way to handle identify error is just logging and returning instead of still writing lp to syncTopicChan.

/cc @mreiferson @jehiah @ploxiln

@andyxning andyxning force-pushed the fix_full_topic_sync_accumulation branch from 9c0bd91 to 67d70f8 Compare February 6, 2018 07:08
@ploxiln
Copy link
Member

ploxiln commented Feb 6, 2018

That looks like a good fix. But I think the explanation isn't quite right.

lookupLoop() initially creates all the LookupPeers, and for each calls Command(nil), just once. That starts the connection - Command() automatically connects if not currently connected. Whenever a connection succeeds, the connectCallback() is called. That callback sends an Identify command. Then it writes to syncTopicChan which sends Register commands.

The problem/fix is for when connecting seemed to succeed, but sending the Identify command failed - in that case, the LookupPeer would automatically mark itself as disconnected. But the callback would continue and write to syncTopicChan, so the lookupLoop() would try to send Register commands, and by calling lookupPeer.Command() would cause it to re-connect. That would trigger the connectCallback() again. There's the busy-loop.

So, I'm skeptical that go-routines accumulated. I think the loop between Command(), connectCallback(), and syncTopicChan, and back to Command() and back to connectCallback(), is to blame for the cpu usage.

I also wonder if anything different should be done for E_INVALID response to Identify.

I also wonder if we should replace the go routines which just wait to send on a channel, with a non-blocking write to a channel with enough capacity for one message for each LookupPeer.

Copy link
Member

@ploxiln ploxiln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that said, adding return here looks correct.

@mreiferson
Copy link
Member

Thanks @andyxning for the proposed fix and @ploxiln for the analysis.

I'd like to think through this before we land it, if y'all don't mind, but I need a few days to have some time to do so.

@mreiferson mreiferson added the bug label Feb 6, 2018
@mreiferson mreiferson changed the title fix full topic sync accumulation nsqd: fix full topic sync accumulation Feb 6, 2018
@andyxning
Copy link
Member Author

@mreiferson That's ok to me. Hope you merge this asap. :)

@andyxning
Copy link
Member Author

ping @mreiferson in case you forget this. :)

@mreiferson
Copy link
Member

mreiferson commented Feb 19, 2018

This fix looks fine, but this suggestion from @ploxiln also seems useful:

I also wonder if we should replace the go routines which just wait to send on a channel, with a non-blocking write to a channel with enough capacity for one message for each LookupPeer.

In general this particular set of code is a mess and could probably use some refactoring. After re-reading it, I can't seem to figure out how on earth a failed connection gets cleaned up.

@andyxning
Copy link
Member Author

@mreiferson

I can't seem to figure out how on earth a failed connection gets cleaned up.

IIUC, you mean that according to the suggestion from @ploxiln , you can not figure out that how to stop failed connections from accumulating?

@andyxning
Copy link
Member Author

Ping @mreiferson

@ploxiln
Copy link
Member

ploxiln commented Feb 28, 2018

After re-reading it, I can't seem to figure out how on earth a failed connection gets cleaned up.

It appears that the lookupPeer manages its connection state, on every Command().

When Command() is called while not connected, it first does Connect(). If Connect() fails, it returns an err from Command(), and that's the end of that, until the next 15-second-ticker tick, when lookupLoop() attempts to send another Ping command, causing the lookupPeer to try to reconnect. If Connect() fails again, then again, that's the end of it, for that tick.

The lookupPeer does not get cleaned up until nsqd is reconfigured - lookupLoop() receives a message over optsNotificationChan, and it then calls Close() on all existing lookupPeer which are not in the new list, then it swaps in the new list.

This all looks like it works as desired. The problem which this PR fixes is when Connect() seems to succeed but then Identify command fails (and then an additional further command is sent anyway, causing another Connect(), which if it succeeds continues the loop).

@andyxning I think mreiferson was waiting to see if you were interested in doing some further small cleanups of this area, highlighted in the discussion.

@ploxiln
Copy link
Member

ploxiln commented Feb 28, 2018

After re-reading it, I can't seem to figure out how on earth a failed connection gets cleaned up.

After looking again, maybe I see some good cause for confusion. As described above, the lookupPeer keeps track of whether it is connected or not, and tries to connect before sending a Command if it is not connected. But check this out: it checks for errors from cmd.WriteTo() and readResponseBounded(), and calls Close() to mark itself as closed. But it doesn't check for err from lp.Write(nsq.MagicV1):

nsq/nsqd/lookup_peer.go

Lines 93 to 120 in e6debab

func (lp *lookupPeer) Command(cmd *nsq.Command) ([]byte, error) {
initialState := lp.state
if lp.state != stateConnected {
err := lp.Connect()
if err != nil {
return nil, err
}
lp.state = stateConnected
lp.Write(nsq.MagicV1)
if initialState == stateDisconnected {
lp.connectCallback(lp)
}
}
if cmd == nil {
return nil, nil
}
_, err := cmd.WriteTo(lp)
if err != nil {
lp.Close()
return nil, err
}
resp, err := readResponseBounded(lp, lp.maxBodySize)
if err != nil {
lp.Close()
return nil, err
}
return resp, nil
}

@andyxning
Copy link
Member Author

andyxning commented Feb 28, 2018

This all looks like it works as desired. The problem which this PR fixes is when Connect() seems to succeed but then Identify command fails (and then an additional further command is sent anyway, causing another Connect(), which if it succeeds continues the loop).

@ploxiln That's is the truth. The Identify command fails to lookupd.

think mreiferson was waiting to see if you were interested in doing some further small cleanups of this area, highlighted in the discussion.

OK, but before that refactor, can we first review and merge this PR?

it checks for errors from cmd.WriteTo() and readResponseBounded(), and calls Close() to mark itself as closed. But it doesn't check for err from lp.Write(nsq.MagicV1)

That's also the truth. We should check the returned error. But i can not make sure that nsqlookupd will also fail with magic string. The better solution is to both check for the errors and fail fast. Maybe we can do this in another PR.

@ploxiln @mreiferson WDYT.

@andyxning
Copy link
Member Author

Ping @mreiferson :)

@mreiferson mreiferson merged commit 1587161 into nsqio:master Mar 3, 2018
@mreiferson
Copy link
Member

Sorry for the delay @andyxning!

@andyxning andyxning deleted the fix_full_topic_sync_accumulation branch March 4, 2018 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants