Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not attempt to re-establish connections when applying a cluster state update #29025

Closed
DaveCTurner opened this issue Mar 13, 2018 · 1 comment · Fixed by #39629
Closed
Assignees
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v6.4.1

Comments

@DaveCTurner
Copy link
Contributor

Today, when a new cluster state is applied, we attempt to connect to all nodes in the new cluster state:

nodeConnectionsService.connectToNodes(newClusterState.nodes());

This set of nodes may include some nodes that were previously connected but which have failed, but whose failure is to be processed in a later cluster state update. Attempts to reconnect to these nodes will also fail, but may do so very slowly if they are unresponsive. On the other hand, the NodesConnectionService takes responsibility for periodically re-establishing connections that have dropped.

This means that there's no real need to attempt to re-establish these connections while applying a cluster state update, and #28920 is an example of a situation where it's undesirable to do so. Therefore it seems sensible to skip these nodes.

NB it's important that the nodes in the cluster state align with the nodes known to the NodeConnectionsService, so they can't simply be omitted in the call to connectToNodes() above.

@DaveCTurner DaveCTurner added help wanted adoptme :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.3.0 labels Mar 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 24, 2018
Today, when a new cluster state is committed we attempt to connect to all of
its nodes as part of the application process. This is the right thing to do
with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves elastic#29025.
@lcawl lcawl added v6.4.1 and removed v6.4.0 labels Aug 23, 2018
@DaveCTurner DaveCTurner added :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Dec 19, 2018
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Mar 4, 2019
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves elastic#29025.
Supersedes elastic#31547.
@DaveCTurner DaveCTurner self-assigned this Mar 12, 2019
@DaveCTurner DaveCTurner removed the help wanted adoptme label Mar 12, 2019
DaveCTurner added a commit that referenced this issue Mar 12, 2019
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves #29025.
DaveCTurner added a commit that referenced this issue Mar 12, 2019
Today, when applying new cluster state we attempt to connect to all of its
nodes as a blocking part of the application process. This is the right thing to
do with new nodes, and is a no-op on any already-connected nodes, but is
questionable on known nodes from which we are currently disconnected: there is
a risk that we are partitioned from these nodes so that any attempt to connect
to them will hang until it times out. This can dramatically slow down the
application of new cluster states which hinders the recovery of the cluster
during certain kinds of partition.

If nodes are disconnected from the master then it is likely that they are to be
removed as part of a subsequent cluster state update, so there's no need to try
and reconnect to them like this. Moreover there is no need to attempt to
reconnect to disconnected nodes as part of the cluster state application
process, because we periodically try and reconnect to any disconnected nodes,
and handle their disconnectedness reasonably gracefully in the meantime.

This commit alters this behaviour to avoid reconnecting to known nodes during
cluster state application.

Resolves #29025.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v6.4.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants