Investigate nodes using stale list when connecting to proxies (discovery protocol). #2832

russjones · 2019-07-04T00:03:24Z

A customer reported they had two proxies 10.10.10.1 and 10.10.10.2 behind a ELB with 94 tunnel nodes connected to the cluster through the proxy.

The replacement procedure for the proxies was to first take down 10.10.10.1 which caused 10.10.10.3 to be spawned then take down 10.10.10.2 which caused 10.10.10.4 to be spawned.

When the replacement procedure was performed, 90/94 nodes re-connected with no problems, but 4 were stuck in the following loop.

DEBU [PROXY:AGE] Changing state discovering -> discovering.
  target:example.com-00000000000000.elb.us-east-1.amazonaws.com:3024 
  reversetunnel/agent.go:198

INFO [PROXY:AGE] Connected to 10.10.10.3:3024
  target:example.com-00000000000000.elb.us-east-1.amazonaws.com:3024 
  reversetunnel/agent.go:413

Missed, connected to [ip-10-10-10-3-us-east-1-compute-internal] instead of
  ip-10-10-10-1-us-east-1-compute-internal,ip-10-10-10-2-us-east-1-compute-internal. 
  target:example.com-00000000000000.elb.us-east-1.amazonaws.com:3024 
  reversetunnel/agent.go:419

Note that the nodes were connecting to the correct proxies, but their internal list was the old list of proxies.

What should have happened is the following: when 10.10.10.3 was added, 10.10.10.2 should have sent a discovery request to all nodes that the list of proxies has been updated and it should try and discover this new proxy. The same should have happened with 10.10.10.2 -> 10.10.10.4. This appears to have happened to 90/94 nodes, but for some reason not 4.

We should investigate and find out why the discovery protocol is failing in this manner.

A temporary workaround to this issue to spawn the new proxies while the old proxies are still available, wait around 15 minutes for the forced discovery message, then tear the old proxies down.

The text was updated successfully, but these errors were encountered:

klizhentas · 2019-07-23T20:53:00Z

@fspmarshall this is a distributed issue with proxies we've been talking about. You can take care of it after backup/restore is done

fspmarshall · 2019-07-24T02:53:00Z

Writing this mostly as a reminder to my future self...

Skimmed the lib/reversetunnel discovery protocol implementation. I still don't have a complete picture, but a few potential issues jumped out at me:

First, and probably more minor, agents simply halt if they fail to connect to their target proxies; the rational put forth in the comments is that another agent must be handling this proxy, but this is never confirmed. Could be an easy way to improve resilience to have the agents propagate the discovery of the proxy in case it wasn't known to the pool.

Second, and more importantly, an incoming set of proxies (the peerRequest message) causes the agent pool to halt all agents attempting to discover proxies not in that set (here). This appears to be a cleanup mechanic, but no mechanism exists to protect the system from ordering issues; if sets were ever to arrive out of order, old states would forcibly override new states, preventing discovery. Whether or not this is the cause of the above issue, its still something which should be addressed. Two potential solutions come to mind: either A) accept that messages are potentially out of order & treat all recently seen messages as potentially valid states regardless of order, or B) use the fact that the admin server's state is a strong consensus point to encode some kind of clock value. Option A is simpler.

klizhentas added this to the 4.1 "Seattle" milestone Jul 23, 2019

klizhentas assigned fspmarshall Jul 23, 2019

fspmarshall mentioned this issue Aug 30, 2019

Add isolated proxy-seeking state machine #2946

Merged

klizhentas closed this as completed in #2946 Sep 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate nodes using stale list when connecting to proxies (discovery protocol). #2832

Investigate nodes using stale list when connecting to proxies (discovery protocol). #2832

russjones commented Jul 4, 2019

klizhentas commented Jul 23, 2019

fspmarshall commented Jul 24, 2019

Investigate nodes using stale list when connecting to proxies (discovery protocol). #2832

Investigate nodes using stale list when connecting to proxies (discovery protocol). #2832

Comments

russjones commented Jul 4, 2019

klizhentas commented Jul 23, 2019

fspmarshall commented Jul 24, 2019