Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for troubleshooting network disconnects #112271

Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 46 additions & 2 deletions docs/reference/modules/discovery/fault-detection.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -153,8 +153,8 @@ other problem.
{es} is designed to run on a fairly reliable network. It opens a number of TCP
connections between nodes and expects these connections to remain open forever.
If a connection is closed then {es} will try and reconnect, so the occasional
blip should have limited impact on the cluster even if the affected node
briefly leaves the cluster. In contrast, repeatedly-dropped connections will
blip may fail some in-flight operations but should otherwise have limited
impact on the cluster. In contrast, repeatedly-dropped connections will
severely affect its operation.

The connections from the elected master node to every other node in the cluster
Expand Down Expand Up @@ -301,3 +301,47 @@ To reconstruct the output, base64-decode the data and decompress it using
cat shardlock.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
----
//end::troubleshooting[]

[discrete]
===== Diagnosing other network disconnections

{es} is designed to run on a fairly reliable network. It opens a number of TCP
connections between nodes and expects these connections to remain open forever.
If a connection is closed then {es} will try and reconnect, so the occasional
blip may fail some in-flight operations but should otherwise have limited
impact on the cluster. In contrast, repeatedly-dropped connections will
severely affect its operation.

{es} nodes will only actively close their outbound connections to another node
if the other node leaves the cluster. See
<<cluster-fault-detection-troubleshooting>> for further information about
identifying and troubleshooting this situation. If an outbound connection
closes for some other reason, nodes will log a message such as the following:

[source,text]
----
[INFO ][o.e.t.ClusterConnectionManager] [node-1] transport connection to [{node-2}{g3cCUaMDQJmQ2ZLtjr-3dg}{10.0.0.1:9300}] closed by remote
----

Similarly, once a connection is fully established, a node never spontaneously
close its inbound connections unless the node is shutting down.

Therefore if you see a node report that a connection to another node closed
unexpectedly, something other than {es} likely caused the connection to close.
A common cause is a misconfigured firewall with an improper timeout or another
policy that's <<long-lived-connections,incompatible with {es}>>. It could also
be caused by general connectivity issues, such as packet loss due to faulty
hardware or network congestion. If you're an advanced user, configure the
following loggers to get more detailed information about network exceptions:

[source,yaml]
----
logger.org.elasticsearch.transport.TcpTransport: DEBUG
logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG
----

If these logs do not show enough information to diagnose the problem, obtain a
packet capture simultaneously from the nodes at both ends of an unstable
connection and analyse it alongside the {es} logs from those nodes to determine
if traffic between the nodes is being disrupted by another device on the
network.
Loading