Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of IOException for missing dictionary bundles #57574

Closed
ppf2 opened this issue Jun 3, 2020 · 2 comments
Closed

Better handling of IOException for missing dictionary bundles #57574

ppf2 opened this issue Jun 3, 2020 · 2 comments
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@ppf2
Copy link
Member

ppf2 commented Jun 3, 2020

6.6.1

If a dictionary file (e.g., stop words file for an analysis chain) is missing from the file system, the cluster can go into a recovery loop trying to start the shard (without success):

[<master>] failing shard [failed shard, shard [<index_name>][0], node[FCTG63TLQKSKR-TP1mWnRA], relocating [AkawpiLAQrWjdscFm8wKRA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=AnHSDjEsRTmgxYlfGKVxyQ, rId=wQVYB77ES1ygVXvWFh6fLw], expected_shard_size[261], message [failed to create index], failure [IllegalArgumentException[IOException while reading stopwords_path: /app/config/<example>-stopwords.txt]; nested: NoSuchFileException[/app/config/<example>-stopwords.txt]; ], markAsStale [true]]
java.lang.IllegalArgumentException: IOException while reading stopwords_path: /app/config/<example>-stopwords.txt
	at org.elasticsearch.index.analysis.Analysis.getWordList(Analysis.java:264) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.Analysis.getWordList(Analysis.java:231) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.Analysis.parseWords(Analysis.java:170) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.Analysis.parseStopWords(Analysis.java:194) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.StopTokenFilterFactory.<init>(StopTokenFilterFactory.java:47) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:355) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenFilterFactories(AnalysisRegistry.java:178) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:159) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.IndexService.<init>(IndexService.java:164) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:397) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:519) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:473) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:156) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:462) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:232) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:486) ~[elasticsearch-6.6.1.jar:6.6.1]
	at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_144]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:483) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:470) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:421) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) ~[elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.6.1.jar:6.6.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.6.1.jar:6.6.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
Caused by: java.nio.file.NoSuchFileException: /app/config/<example>-stopwords.txt
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
	at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) ~[?:?]
	at co.elastic.cloud.quotaawarefs.QuotaAwareFileSystemProvider.newByteChannel(QuotaAwareFileSystemProvider.java:264) ~[quota-aware-fs-1.1.1-SNAPSHOT.jar:?]
	at java.nio.file.Files.newByteChannel(Files.java:361) ~[?:1.8.0_144]
	at java.nio.file.Files.newByteChannel(Files.java:407) ~[?:1.8.0_144]
	at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384) ~[?:1.8.0_144]
	at java.nio.file.Files.newInputStream(Files.java:152) ~[?:1.8.0_144]
	at java.nio.file.Files.newBufferedReader(Files.java:2784) ~[?:1.8.0_144]
	at org.elasticsearch.index.analysis.Analysis.getWordList(Analysis.java:255) ~[elasticsearch-6.6.1.jar:6.6.1]
	... 26 more

This has led to issues like the memory leak bug (#48230) that is fixed in a later version. The other undesirable effect is that it causes the master node to be consumed in dealing with a never-ending loop of shard-failed tasks (which are higher priority than "normal" tasks like snapshots). For example, this causes snapshot requests to keep failing with a ProcessClusterEventTimeoutException until the IOException is addressed (or until the master node is less busy).

failed to create snapshot
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create_snapshot [scheduled-1591143244-instance-0000000013]) within 5m
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:129) ~[elasticsearch-6.6.1.ja....

Have we considered making these permanent failures so that it doesn't go and retry recovery indefinitely, which can cause other issues in the cluster?

This recovery loop could also happen with other non-IOExceptions such as when the dictionary file is not serializable:

[node_name] [index_name][0] received shard failed for shard id [[index_name][0]], allocation id [TQL8ZQWpS5yfJvv9Amg-Xw], primary term [0], message [failed to create index], failure [NotSerializableExceptionWrapper[runtime_exception: Illegal user dictionary entry i am - the number of segmentations (2) does not the match number of readings (1)]]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: runtime_exception: Illegal user dictionary entry i am - the number of segmentations (2) does not the match number of readings (1)
	at org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:112) ~[?:?]
	at org.apache.lucene.analysis.ja.dict.UserDictionary.open(UserDictionary.java:81) ~[?:?]
	at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.getUserDictionary(KuromojiTokenizerFactory.java:65) ~[?:?]
	at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.<init>(KuromojiTokenizerFactory.java:52) ~[?:?]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:342) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:176) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:154) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.IndexService.<init>(IndexService.java:145) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:363) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:448) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:413) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:147) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:444) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:202) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:814) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:768) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:587) ~[elasticsearch-5.6.9.jar:5.6.9]
	at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.run(ClusterService.java:263) ~[elasticsearch-5.6.9.jar:5.6.9]
@ppf2 ppf2 added >enhancement needs:triage Requires assignment of a team area label :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jun 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 3, 2020
@DaveCTurner DaveCTurner added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >bug and removed :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. needs:triage Requires assignment of a team area label >enhancement labels Jun 3, 2020
@DaveCTurner
Copy link
Contributor

I'm closing this because I think we handle this kind of failure acceptably today. It's not a never-ending recovery loop, we stop after 5 failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

3 participants