From e26b53f2816953d66fc853a53829bd6491bd609c Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 28 Aug 2024 09:02:24 +0100 Subject: [PATCH 1/4] Add docs for troubleshooting network disconnects Basically the same as for nodes that leave the cluster with reason `disconnected`, except that these disconnects don't involve the master so don't cause any nodes to leave the cluster. --- .../discovery/fault-detection.asciidoc | 48 ++++++++++++++++++- 1 file changed, 46 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index 383e4c6044c67..a82e2468ce4a4 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -153,8 +153,8 @@ other problem. {es} is designed to run on a fairly reliable network. It opens a number of TCP connections between nodes and expects these connections to remain open forever. If a connection is closed then {es} will try and reconnect, so the occasional -blip should have limited impact on the cluster even if the affected node -briefly leaves the cluster. In contrast, repeatedly-dropped connections will +blip may fail some in-flight operations but should otherwise have limited +impact on the cluster. In contrast, repeatedly-dropped connections will severely affect its operation. The connections from the elected master node to every other node in the cluster @@ -301,3 +301,47 @@ To reconstruct the output, base64-decode the data and decompress it using cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress ---- //end::troubleshooting[] + +[discrete] +===== Diagnosing other network disconnections + +{es} is designed to run on a fairly reliable network. It opens a number of TCP +connections between nodes and expects these connections to remain open forever. +If a connection is closed then {es} will try and reconnect, so the occasional +blip may fail some in-flight operations but should otherwise have limited +impact on the cluster. In contrast, repeatedly-dropped connections will +severely affect its operation. + +{es} nodes will only actively close their outbound connections to another node +if the other node leaves the cluster. See +<> for further information about +identifying and troubleshooting this situation. If an outbound connection +closes for some other reason, nodes will log a message such as the following: + +[source,text] +---- +[INFO ][o.e.t.ClusterConnectionManager] [node-1] transport connection to [{node-2}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] closed by remote +---- + +Similarly, once a connection is fully established, a node never spontaneously +close its inbound connections unless the node is shutting down. + +Therefore if you see a node report that a connection to another node closed +unexpectedly, something other than {es} likely caused the connection to close. +A common cause is a misconfigured firewall with an improper timeout or another +policy that's <>. It could also +be caused by general connectivity issues, such as packet loss due to faulty +hardware or network congestion. If you're an advanced user, configure the +following loggers to get more detailed information about network exceptions: + +[source,yaml] +---- +logger.org.elasticsearch.transport.TcpTransport: DEBUG +logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG +---- + +If these logs do not show enough information to diagnose the problem, obtain a +packet capture simultaneously from the nodes at both ends of an unstable +connection and analyse it alongside the {es} logs from those nodes to determine +if traffic between the nodes is being disrupted by another device on the +network. From 6a53e014721ba770ec2b63372abb056e7b43f30c Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 28 Aug 2024 09:30:29 +0100 Subject: [PATCH 2/4] Grammar fix --- docs/reference/modules/discovery/fault-detection.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index a82e2468ce4a4..a79158867f422 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -160,7 +160,7 @@ severely affect its operation. The connections from the elected master node to every other node in the cluster are particularly important. The elected master never spontaneously closes its outbound connections to other nodes. Similarly, once a connection is fully -established, a node never spontaneously close its inbound connections unless +established, a node never spontaneously closes its inbound connections unless the node is shutting down. If you see a node unexpectedly leave the cluster with the `disconnected` @@ -324,7 +324,7 @@ closes for some other reason, nodes will log a message such as the following: ---- Similarly, once a connection is fully established, a node never spontaneously -close its inbound connections unless the node is shutting down. +closes its inbound connections unless the node is shutting down. Therefore if you see a node report that a connection to another node closed unexpectedly, something other than {es} likely caused the connection to close. From 7a2b0f426197da736ab9c2f2a8e095b2d7ba166c Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 28 Aug 2024 09:34:27 +0100 Subject: [PATCH 3/4] Reword --- .../modules/discovery/fault-detection.asciidoc | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index a79158867f422..cc17e96aa080f 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -159,9 +159,9 @@ severely affect its operation. The connections from the elected master node to every other node in the cluster are particularly important. The elected master never spontaneously closes its -outbound connections to other nodes. Similarly, once a connection is fully -established, a node never spontaneously closes its inbound connections unless -the node is shutting down. +outbound connections to other nodes. Similarly, once an inbound connection is +fully established, a node never spontaneously it unless the node is shutting +down. If you see a node unexpectedly leave the cluster with the `disconnected` reason, something other than {es} likely caused the connection to close. A @@ -312,8 +312,8 @@ blip may fail some in-flight operations but should otherwise have limited impact on the cluster. In contrast, repeatedly-dropped connections will severely affect its operation. -{es} nodes will only actively close their outbound connections to another node -if the other node leaves the cluster. See +{es} nodes will only actively close an outbound connection to another node if +the other node leaves the cluster. See <> for further information about identifying and troubleshooting this situation. If an outbound connection closes for some other reason, nodes will log a message such as the following: @@ -323,8 +323,8 @@ closes for some other reason, nodes will log a message such as the following: [INFO ][o.e.t.ClusterConnectionManager] [node-1] transport connection to [{node-2}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] closed by remote ---- -Similarly, once a connection is fully established, a node never spontaneously -closes its inbound connections unless the node is shutting down. +Similarly, once an inbound connection is fully established, a node never +spontaneously closes it unless the node is shutting down. Therefore if you see a node report that a connection to another node closed unexpectedly, something other than {es} likely caused the connection to close. From 1f204043e7597faca1e3ec762ed44251f40ad987 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 28 Aug 2024 09:35:12 +0100 Subject: [PATCH 4/4] link "forever" --- .../discovery/fault-detection.asciidoc | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index cc17e96aa080f..89c8a78eccbc6 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -151,11 +151,11 @@ down, but if they rejoin the cluster without restarting then there is some other problem. {es} is designed to run on a fairly reliable network. It opens a number of TCP -connections between nodes and expects these connections to remain open forever. -If a connection is closed then {es} will try and reconnect, so the occasional -blip may fail some in-flight operations but should otherwise have limited -impact on the cluster. In contrast, repeatedly-dropped connections will -severely affect its operation. +connections between nodes and expects these connections to remain open +<>. If a connection is closed then {es} will +try and reconnect, so the occasional blip may fail some in-flight operations +but should otherwise have limited impact on the cluster. In contrast, +repeatedly-dropped connections will severely affect its operation. The connections from the elected master node to every other node in the cluster are particularly important. The elected master never spontaneously closes its @@ -306,11 +306,11 @@ cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress ===== Diagnosing other network disconnections {es} is designed to run on a fairly reliable network. It opens a number of TCP -connections between nodes and expects these connections to remain open forever. -If a connection is closed then {es} will try and reconnect, so the occasional -blip may fail some in-flight operations but should otherwise have limited -impact on the cluster. In contrast, repeatedly-dropped connections will -severely affect its operation. +connections between nodes and expects these connections to remain open +<>. If a connection is closed then {es} will +try and reconnect, so the occasional blip may fail some in-flight operations +but should otherwise have limited impact on the cluster. In contrast, +repeatedly-dropped connections will severely affect its operation. {es} nodes will only actively close an outbound connection to another node if the other node leaves the cluster. See