nsqd: fix full topic sync accumulation #993

andyxning · 2018-02-06T07:05:57Z

This PR will avoid the topic sync accumulation when error in trying to connect to nsqlookupd for register or unregister.

Currently, if nsqd can not do register/unregister a topic/channel, the lp address is sent to syncTopicChan. Because syncTopicChan is a blocked channel, the goroutine for writing lp to syncTopicChan will be accumulation. After the network to nsqlookupd is ok, the accumulated requests in syncTopicChan will be consumed by nsqd and a full topic/channel register will be operated. This will cause a performance downgrade for a busy nsqd instance.

The proper way to handle identify error is just logging and returning instead of still writing lp to syncTopicChan.

/cc @mreiferson @jehiah @ploxiln

ploxiln · 2018-02-06T18:02:15Z

That looks like a good fix. But I think the explanation isn't quite right.

lookupLoop() initially creates all the LookupPeers, and for each calls Command(nil), just once. That starts the connection - Command() automatically connects if not currently connected. Whenever a connection succeeds, the connectCallback() is called. That callback sends an Identify command. Then it writes to syncTopicChan which sends Register commands.

The problem/fix is for when connecting seemed to succeed, but sending the Identify command failed - in that case, the LookupPeer would automatically mark itself as disconnected. But the callback would continue and write to syncTopicChan, so the lookupLoop() would try to send Register commands, and by calling lookupPeer.Command() would cause it to re-connect. That would trigger the connectCallback() again. There's the busy-loop.

So, I'm skeptical that go-routines accumulated. I think the loop between Command(), connectCallback(), and syncTopicChan, and back to Command() and back to connectCallback(), is to blame for the cpu usage.

I also wonder if anything different should be done for E_INVALID response to Identify.

I also wonder if we should replace the go routines which just wait to send on a channel, with a non-blocking write to a channel with enough capacity for one message for each LookupPeer.

ploxiln

All that said, adding return here looks correct.

mreiferson · 2018-02-06T18:13:48Z

Thanks @andyxning for the proposed fix and @ploxiln for the analysis.

I'd like to think through this before we land it, if y'all don't mind, but I need a few days to have some time to do so.

andyxning · 2018-02-07T03:42:41Z

@mreiferson That's ok to me. Hope you merge this asap. :)

andyxning · 2018-02-11T04:12:19Z

ping @mreiferson in case you forget this. :)

mreiferson · 2018-02-19T17:22:46Z

This fix looks fine, but this suggestion from @ploxiln also seems useful:

I also wonder if we should replace the go routines which just wait to send on a channel, with a non-blocking write to a channel with enough capacity for one message for each LookupPeer.

In general this particular set of code is a mess and could probably use some refactoring. After re-reading it, I can't seem to figure out how on earth a failed connection gets cleaned up.

andyxning · 2018-02-26T06:57:29Z

@mreiferson

I can't seem to figure out how on earth a failed connection gets cleaned up.

IIUC, you mean that according to the suggestion from @ploxiln , you can not figure out that how to stop failed connections from accumulating?

andyxning · 2018-02-27T22:37:51Z

Ping @mreiferson

ploxiln · 2018-02-28T08:03:02Z

After re-reading it, I can't seem to figure out how on earth a failed connection gets cleaned up.

It appears that the lookupPeer manages its connection state, on every Command().

When Command() is called while not connected, it first does Connect(). If Connect() fails, it returns an err from Command(), and that's the end of that, until the next 15-second-ticker tick, when lookupLoop() attempts to send another Ping command, causing the lookupPeer to try to reconnect. If Connect() fails again, then again, that's the end of it, for that tick.

The lookupPeer does not get cleaned up until nsqd is reconfigured - lookupLoop() receives a message over optsNotificationChan, and it then calls Close() on all existing lookupPeer which are not in the new list, then it swaps in the new list.

This all looks like it works as desired. The problem which this PR fixes is when Connect() seems to succeed but then Identify command fails (and then an additional further command is sent anyway, causing another Connect(), which if it succeeds continues the loop).

@andyxning I think mreiferson was waiting to see if you were interested in doing some further small cleanups of this area, highlighted in the discussion.

ploxiln · 2018-02-28T08:11:11Z

After re-reading it, I can't seem to figure out how on earth a failed connection gets cleaned up.

After looking again, maybe I see some good cause for confusion. As described above, the lookupPeer keeps track of whether it is connected or not, and tries to connect before sending a Command if it is not connected. But check this out: it checks for errors from cmd.WriteTo() and readResponseBounded(), and calls Close() to mark itself as closed. But it doesn't check for err from lp.Write(nsq.MagicV1):

nsq/nsqd/lookup_peer.go

Lines 93 to 120 in e6debab

    
           func (lp *lookupPeer) Command(cmd *nsq.Command) ([]byte, error) { 
        
           	initialState := lp.state 
        
           	if lp.state != stateConnected { 
        
           		err := lp.Connect() 
        
           		if err != nil { 
        
           			return nil, err 
        
           		} 
        
           		lp.state = stateConnected 
        
           		lp.Write(nsq.MagicV1) 
        
           		if initialState == stateDisconnected { 
        
           			lp.connectCallback(lp) 
        
           		} 
        
           	} 
        
           	if cmd == nil { 
        
           		return nil, nil 
        
           	} 
        
           	_, err := cmd.WriteTo(lp) 
        
           	if err != nil { 
        
           		lp.Close() 
        
           		return nil, err 
        
           	} 
        
           	resp, err := readResponseBounded(lp, lp.maxBodySize) 
        
           	if err != nil { 
        
           		lp.Close() 
        
           		return nil, err 
        
           	} 
        
           	return resp, nil 
        
           }

andyxning · 2018-02-28T09:58:36Z

This all looks like it works as desired. The problem which this PR fixes is when Connect() seems to succeed but then Identify command fails (and then an additional further command is sent anyway, causing another Connect(), which if it succeeds continues the loop).

@ploxiln That's is the truth. The Identify command fails to lookupd.

think mreiferson was waiting to see if you were interested in doing some further small cleanups of this area, highlighted in the discussion.

OK, but before that refactor, can we first review and merge this PR?

it checks for errors from cmd.WriteTo() and readResponseBounded(), and calls Close() to mark itself as closed. But it doesn't check for err from lp.Write(nsq.MagicV1)

That's also the truth. We should check the returned error. But i can not make sure that nsqlookupd will also fail with magic string. The better solution is to both check for the errors and fail fast. Maybe we can do this in another PR.

@ploxiln @mreiferson WDYT.

andyxning · 2018-03-03T15:19:46Z

Ping @mreiferson :)

mreiferson · 2018-03-03T18:03:59Z

Sorry for the delay @andyxning!

fix full topic sync accumulation

67d70f8

andyxning force-pushed the fix_full_topic_sync_accumulation branch from 9c0bd91 to 67d70f8 Compare February 6, 2018 07:08

ploxiln approved these changes Feb 6, 2018

View reviewed changes

mreiferson added the bug label Feb 6, 2018

mreiferson changed the title ~~fix full topic sync accumulation~~ nsqd: fix full topic sync accumulation Feb 6, 2018

mreiferson approved these changes Mar 3, 2018

View reviewed changes

mreiferson merged commit 1587161 into nsqio:master Mar 3, 2018

andyxning deleted the fix_full_topic_sync_accumulation branch March 4, 2018 01:31

ploxiln mentioned this pull request Mar 6, 2018

nsqd: nsqlookupd connect callback simpler, more robust #1008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsqd: fix full topic sync accumulation #993

nsqd: fix full topic sync accumulation #993

andyxning commented Feb 6, 2018

ploxiln commented Feb 6, 2018

ploxiln left a comment

mreiferson commented Feb 6, 2018

andyxning commented Feb 7, 2018

andyxning commented Feb 11, 2018

mreiferson commented Feb 19, 2018 •

edited

Loading

andyxning commented Feb 26, 2018

andyxning commented Feb 27, 2018

ploxiln commented Feb 28, 2018

ploxiln commented Feb 28, 2018

andyxning commented Feb 28, 2018 •

edited

Loading

andyxning commented Mar 3, 2018

mreiferson commented Mar 3, 2018

nsqd: fix full topic sync accumulation #993

nsqd: fix full topic sync accumulation #993

Conversation

andyxning commented Feb 6, 2018

ploxiln commented Feb 6, 2018

ploxiln left a comment

Choose a reason for hiding this comment

mreiferson commented Feb 6, 2018

andyxning commented Feb 7, 2018

andyxning commented Feb 11, 2018

mreiferson commented Feb 19, 2018 • edited Loading

andyxning commented Feb 26, 2018

andyxning commented Feb 27, 2018

ploxiln commented Feb 28, 2018

ploxiln commented Feb 28, 2018

andyxning commented Feb 28, 2018 • edited Loading

andyxning commented Mar 3, 2018

mreiferson commented Mar 3, 2018

mreiferson commented Feb 19, 2018 •

edited

Loading

andyxning commented Feb 28, 2018 •

edited

Loading