Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate nodes using stale list when connecting to proxies (discovery protocol). #2832

Closed
russjones opened this issue Jul 4, 2019 · 2 comments · Fixed by #2946
Closed
Assignees
Milestone

Comments

@russjones
Copy link
Contributor

A customer reported they had two proxies 10.10.10.1 and 10.10.10.2 behind a ELB with 94 tunnel nodes connected to the cluster through the proxy.

The replacement procedure for the proxies was to first take down 10.10.10.1 which caused 10.10.10.3 to be spawned then take down 10.10.10.2 which caused 10.10.10.4 to be spawned.

When the replacement procedure was performed, 90/94 nodes re-connected with no problems, but 4 were stuck in the following loop.

DEBU [PROXY:AGE] Changing state discovering -> discovering.
  target:example.com-00000000000000.elb.us-east-1.amazonaws.com:3024 
  reversetunnel/agent.go:198

INFO [PROXY:AGE] Connected to 10.10.10.3:3024
  target:example.com-00000000000000.elb.us-east-1.amazonaws.com:3024 
  reversetunnel/agent.go:413

Missed, connected to [ip-10-10-10-3-us-east-1-compute-internal] instead of
  ip-10-10-10-1-us-east-1-compute-internal,ip-10-10-10-2-us-east-1-compute-internal. 
  target:example.com-00000000000000.elb.us-east-1.amazonaws.com:3024 
  reversetunnel/agent.go:419

Note that the nodes were connecting to the correct proxies, but their internal list was the old list of proxies.

What should have happened is the following: when 10.10.10.3 was added, 10.10.10.2 should have sent a discovery request to all nodes that the list of proxies has been updated and it should try and discover this new proxy. The same should have happened with 10.10.10.2 -> 10.10.10.4. This appears to have happened to 90/94 nodes, but for some reason not 4.

We should investigate and find out why the discovery protocol is failing in this manner.

A temporary workaround to this issue to spawn the new proxies while the old proxies are still available, wait around 15 minutes for the forced discovery message, then tear the old proxies down.

@klizhentas
Copy link
Contributor

@fspmarshall this is a distributed issue with proxies we've been talking about. You can take care of it after backup/restore is done

@fspmarshall
Copy link
Contributor

Writing this mostly as a reminder to my future self...

Skimmed the lib/reversetunnel discovery protocol implementation. I still don't have a complete picture, but a few potential issues jumped out at me:

First, and probably more minor, agents simply halt if they fail to connect to their target proxies; the rational put forth in the comments is that another agent must be handling this proxy, but this is never confirmed. Could be an easy way to improve resilience to have the agents propagate the discovery of the proxy in case it wasn't known to the pool.

Second, and more importantly, an incoming set of proxies (the peerRequest message) causes the agent pool to halt all agents attempting to discover proxies not in that set (here). This appears to be a cleanup mechanic, but no mechanism exists to protect the system from ordering issues; if sets were ever to arrive out of order, old states would forcibly override new states, preventing discovery. Whether or not this is the cause of the above issue, its still something which should be addressed. Two potential solutions come to mind: either A) accept that messages are potentially out of order & treat all recently seen messages as potentially valid states regardless of order, or B) use the fact that the admin server's state is a strong consensus point to encode some kind of clock value. Option A is simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants