Better handling of IOException for missing dictionary bundles #57574
Labels
>bug
:Distributed Coordination/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
6.6.1
If a dictionary file (e.g., stop words file for an analysis chain) is missing from the file system, the cluster can go into a recovery loop trying to start the shard (without success):
This has led to issues like the memory leak bug (#48230) that is fixed in a later version. The other undesirable effect is that it causes the master node to be consumed in dealing with a never-ending loop of shard-failed tasks (which are higher priority than "normal" tasks like snapshots). For example, this causes snapshot requests to keep failing with a ProcessClusterEventTimeoutException until the IOException is addressed (or until the master node is less busy).
Have we considered making these permanent failures so that it doesn't go and retry recovery indefinitely, which can cause other issues in the cluster?
This recovery loop could also happen with other non-IOExceptions such as when the dictionary file is not serializable:
The text was updated successfully, but these errors were encountered: