Better handling of IOException for missing dictionary bundles #57574

ppf2 · 2020-06-03T01:53:35Z

6.6.1

If a dictionary file (e.g., stop words file for an analysis chain) is missing from the file system, the cluster can go into a recovery loop trying to start the shard (without success):

[<master>] failing shard [failed shard, shard [<index_name>][0], node[FCTG63TLQKSKR-TP1mWnRA], relocating [AkawpiLAQrWjdscFm8wKRA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=AnHSDjEsRTmgxYlfGKVxyQ, rId=wQVYB77ES1ygVXvWFh6fLw], expected_shard_size[261], message [failed to create index], failure [IllegalArgumentException[IOException while reading stopwords_path: /app/config/<example>-stopwords.txt]; nested: NoSuchFileException[/app/config/<example>-stopwords.txt]; ], markAsStale [true]]
java.lang.IllegalArgumentException: IOException while reading stopwords_path: /app/config/<example>-stopwords.txt
	at org.elasticsearch.index.analysis.Analysis.getWordList(Analysis.java:264) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.Analysis.getWordList(Analysis.java:231) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.Analysis.parseWords(Analysis.java:170) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.Analysis.parseStopWords(Analysis.java:194) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.StopTokenFilterFactory.<init>(StopTokenFilterFactory.java:47) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:355) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenFilterFactories(AnalysisRegistry.java:178) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:159) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.IndexService.<init>(IndexService.java:164) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:397) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:519) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:473) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:156) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:462) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:232) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:486) ~[elasticsearch-6.6.1.jar:6.6.1]
	at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_144]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:483) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:470) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:421) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.6.1.jar:6.6.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: /app/config/<example>-stopwords.txt
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) ~[?:?]
	at co.elastic.cloud.quotaawarefs.QuotaAwareFileSystemProvider.newByteChannel(QuotaAwareFileSystemProvider.java:264) ~[quota-aware-fs-1.1.1-SNAPSHOT.jar:?]
	at java.nio.file.Files.newByteChannel(Files.java:361) ~[?:1.8.0_144]
	at java.nio.file.Files.newByteChannel(Files.java:407) ~[?:1.8.0_144]
	at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384) ~[?:1.8.0_144]
	at java.nio.file.Files.newInputStream(Files.java:152) ~[?:1.8.0_144]
	at java.nio.file.Files.newBufferedReader(Files.java:2784) ~[?:1.8.0_144]
	at org.elasticsearch.index.analysis.Analysis.getWordList(Analysis.java:255) ~[elasticsearch-6.6.1.jar:6.6.1]
	... 26 more

This has led to issues like the memory leak bug (#48230) that is fixed in a later version. The other undesirable effect is that it causes the master node to be consumed in dealing with a never-ending loop of shard-failed tasks (which are higher priority than "normal" tasks like snapshots). For example, this causes snapshot requests to keep failing with a ProcessClusterEventTimeoutException until the IOException is addressed (or until the master node is less busy).

failed to create snapshot
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create_snapshot [scheduled-1591143244-instance-0000000013]) within 5m
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:129) ~[elasticsearch-6.6.1.ja....

Have we considered making these permanent failures so that it doesn't go and retry recovery indefinitely, which can cause other issues in the cluster?

This recovery loop could also happen with other non-IOExceptions such as when the dictionary file is not serializable:

[node_name] [index_name][0] received shard failed for shard id [[index_name][0]], allocation id [TQL8ZQWpS5yfJvv9Amg-Xw], primary term [0], message [failed to create index], failure [NotSerializableExceptionWrapper[runtime_exception: Illegal user dictionary entry i am - the number of segmentations (2) does not the match number of readings (1)]]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: runtime_exception: Illegal user dictionary entry i am - the number of segmentations (2) does not the match number of readings (1)
	at org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:112) ~[?:?]
	at org.apache.lucene.analysis.ja.dict.UserDictionary.open(UserDictionary.java:81) ~[?:?]
	at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.getUserDictionary(KuromojiTokenizerFactory.java:65) ~[?:?]
	at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.<init>(KuromojiTokenizerFactory.java:52) ~[?:?]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:342) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:176) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:154) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.IndexService.<init>(IndexService.java:145) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:363) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:448) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:413) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:147) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:444) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:202) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263) ~[elasticsearch-5.6.9.jar:5.6.9]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-03T01:55:50Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

DaveCTurner · 2022-07-29T09:49:15Z

I'm closing this because I think we handle this kind of failure acceptably today. It's not a never-ending recovery loop, we stop after 5 failures.

ppf2 added >enhancement needs:triage Requires assignment of a team area label :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 3, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 3, 2020

DaveCTurner closed this as completed Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of IOException for missing dictionary bundles #57574

Better handling of IOException for missing dictionary bundles #57574

ppf2 commented Jun 3, 2020 •

edited

Loading

elasticmachine commented Jun 3, 2020

DaveCTurner commented Jul 29, 2022

Better handling of IOException for missing dictionary bundles #57574

Better handling of IOException for missing dictionary bundles #57574

Comments

ppf2 commented Jun 3, 2020 • edited Loading

elasticmachine commented Jun 3, 2020

DaveCTurner commented Jul 29, 2022

ppf2 commented Jun 3, 2020 •

edited

Loading