Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of readonly filesystems #45286

Closed
3 tasks done
DaveCTurner opened this issue Aug 7, 2019 · 9 comments
Closed
3 tasks done

Improve handling of readonly filesystems #45286

DaveCTurner opened this issue Aug 7, 2019 · 9 comments
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. resiliency Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Aug 7, 2019

Today we do not allow a node to start if its filesystem is readonly, but it is possible for a filesystem to become readonly while the node is running. We don't currently have any infrastructure in place to make sure that Elasticsearch behaves well if this happens. A node that cannot write to disk may be poisonous to the rest of the cluster:

  • a readonly master-eligible node repeatedly leaves and rejoins the cluster, and may trigger elections by offering a vote that it cannot give
  • a readonly data node may be assigned shards (e.g. due to rebalancing) which will then immediately fail (cf Prevent allocating shards to broken nodes #18417). Any shards that are already assigned to the node may also eventually fail too (e.g. when syncing retention leases triggers some IO).

This issue is to improve Elasticsearch's behaviour when a node becomes readonly:

  • verify that the data paths are writeable before joining a cluster
  • verify that the data paths are writeable before offering a pre-vote
  • periodically verify that data paths are writeable, and leave the cluster if they are not
@DaveCTurner DaveCTurner added resiliency :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Aug 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@Bukhtawar
Copy link
Contributor

@DaveCTurner thanks for taking this up. Is this fix already prioritized? Do you think I can pick any of the task since I have some context on the issue based on our discussions

@DaveCTurner
Copy link
Contributor Author

This work is not yet on our roadmap, but if you have ideas for how to proceed then it'd be good to see them. Note that I'll be unavailable for the next few weeks so don't expect prompt feedback from me.

@pugnascotia
Copy link
Contributor

#25591 feels relevant here - it covers documenting what is a cluster's behaviour when a filesystem crashes but the node(s) remains operational.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Sep 12, 2019

Indeed, solving this issue will also resolve #25591 since here we are aiming to exclude the cases where the filesystem is unusable but the node remains in the cluster.

@DaveCTurner
Copy link
Contributor Author

For completeness, "readonly" also includes cases such as Disk quota exceeded.

@Bukhtawar
Copy link
Contributor

Bukhtawar commented Feb 10, 2020

@DaveCTurner, Just a heads up I would be raising a PR for the issue, this week hopefully.
One clarification though, shouldn't master eligible readonly nodes also be blocked from starting a pre-voting round

@DaveCTurner
Copy link
Contributor Author

shouldn't master eligible readonly nodes also be blocked from starting a pre-voting round

Sounds reasonable, yes. Or else we block sending pre-votes (mentioned in the OP) and require a pre-vote from the local node before PreVotingRound#handlePreVoteResponse starts the election.

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Feb 10, 2020

Actually I think I prefer the latter idea: require a pre-vote from the local node. A side-effect of receiving a pre-voting request is to call Coordinator#updateMaxTermSeen which is best to call as early as possible. If we delayed that until a node stopped being read-only then it could trigger another election in an otherwise healthy cluster, which would be a bit weird.


Edit: I'm unsure again; I see advantages on both sides. I think it's a minor point and hopefully easily adjusted later, so let's not dwell on it.

Bukhtawar added a commit to Bukhtawar/elasticsearch that referenced this issue Feb 23, 2020
…ite to all paths and emits a stats is_writable as a part of node stats API.

 FsReadOnlyMonitor pulls up the stats and tries to remove the node if not all paths are found to be writable.
 Addresses elastic#45286.
@rjernst rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020
DaveCTurner pushed a commit to DaveCTurner/elasticsearch that referenced this issue Jul 7, 2020
Today we do not allow a node to start if its filesystem is readonly, but
it is possible for a filesystem to become readonly while the node is
running. We don't currently have any infrastructure in place to make
sure that Elasticsearch behaves well if this happens. A node that cannot
write to disk may be poisonous to the rest of the cluster.

With this commit we periodically verify that nodes' filesystems are
writable. If a node fails these writability checks then it is removed
from the cluster and prevented from re-joining until the checks start
passing again.

Closes elastic#45286
DaveCTurner added a commit that referenced this issue Jul 7, 2020
Today we do not allow a node to start if its filesystem is readonly, but
it is possible for a filesystem to become readonly while the node is
running. We don't currently have any infrastructure in place to make
sure that Elasticsearch behaves well if this happens. A node that cannot
write to disk may be poisonous to the rest of the cluster.

With this commit we periodically verify that nodes' filesystems are
writable. If a node fails these writability checks then it is removed
from the cluster and prevented from re-joining until the checks start
passing again.

Closes #45286

Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. resiliency Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

5 participants