Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Remote Store] Snapshot failing with metadata not found for given commit generation and Primary Term #10217

Closed
harishbhakuni opened this issue Sep 25, 2023 · 6 comments
Assignees
Labels
bug Something isn't working Storage Issues and PRs relating to data and metadata storage v2.11.0 Issues and PRs related to version 2.11.0

Comments

@harishbhakuni
Copy link
Contributor

harishbhakuni commented Sep 25, 2023

Describe the bug
Sometimes Shallow copy snapshot fails with java.nio.file.NoSuchFileException: Metadata file is not present for given primary term x and generation y. This was not failing before this change to enable remote translog by default for all remote store enabled indices. With remote translog, for some reason sometimes metadata upload is failing.

StackTrace

 1> [2023-09-20T18:35:08,312][INFO ][o.o.g.GatewayService     ] [node_s_0] recovered [0] indices into cluster_state
  1> [2023-09-20T18:35:08,335][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[5xKtmZlJRVebR0ZXyQUZTg/3pX3b02mT0i4h__cwb5nog]
  1> [2023-09-20T18:35:08,475][INFO ][o.o.c.m.MetadataIndexTemplateService] [node_s_0] adding template [one_shard_index_template] for index patterns [*]
  1> [2023-09-20T18:35:08,529][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[gW_0lhzgRQ2jP722YTjFDA/87LlrmZPT66T9xhIbvg9rQ]
  1> [2023-09-20T18:35:08,531][INFO ][o.o.c.m.MetadataIndexTemplateService] [node_s_0] adding template [random-soft-deletes-template] for index patterns [*]
  1> [2023-09-20T18:35:08,572][INFO ][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] before test
  1> [2023-09-20T18:35:08,574][INFO ][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] -->  creating snapshot repository
  1> [2023-09-20T18:35:08,610][INFO ][o.o.r.RepositoriesService] [node_s_0] put repository [test-repo]
  1> [2023-09-20T18:35:08,668][INFO ][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] --> creating an index and indexing documents
  1> [2023-09-20T18:35:08,674][DEPRECATION][o.o.d.c.m.MetadataCreateIndexService] [node_s_0] index [test-idx] matches multiple legacy templates [one_shard_index_template, random-soft-deletes-template], composable templates will only match a single template
  1> [2023-09-20T18:35:08,682][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test-idx/Mp2VSUHSTjSU5CV2fZWKfw]
  1> [2023-09-20T18:35:08,687][INFO ][o.o.c.m.MetadataCreateIndexService] [node_s_0] [test-idx] creating index, cause [api], templates [random-soft-deletes-template, one_shard_index_template], shards [1]/[0]
  1> [2023-09-20T18:35:08,755][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test-idx/Mp2VSUHSTjSU5CV2fZWKfw]
  1> [2023-09-20T18:35:09,152][DEBUG][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] indices [_all] are green
  1> [2023-09-20T18:35:09,242][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test-idx/Mp2VSUHSTjSU5CV2fZWKfw]
  1> [2023-09-20T18:35:09,247][INFO ][o.o.c.m.MetadataMappingService] [node_s_0] [test-idx/Mp2VSUHSTjSU5CV2fZWKfw] create_mapping
  1> [2023-09-20T18:35:11,975][INFO ][o.o.i.t.t.TranslogTransferManager] [node_s_0] [test-idx][0] Deleting primary terms from remote store lesser than 1
  1> [2023-09-20T18:35:18,596][INFO ][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] --> creating a remote store enabled index and indexing documents
  1> [2023-09-20T18:35:18,598][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test-rs-idx/SVXDkTFQRaiuRcGdZIu6iQ]
  1> [2023-09-20T18:35:18,601][INFO ][o.o.c.m.MetadataCreateIndexService] [node_s_0] [test-rs-idx] creating index, cause [api], templates [random-soft-deletes-template, one_shard_index_template], shards [1]/[1]
  1> [2023-09-20T18:35:18,642][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test-rs-idx/SVXDkTFQRaiuRcGdZIu6iQ]
  1> [2023-09-20T18:35:18,794][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node_s_0] [test-rs-idx][0] Scheduled retry with didRefresh=true
  1> [2023-09-20T18:35:18,806][INFO ][o.o.p.PluginsService     ] [node_s_0] PluginService:onIndexModule index:[test-rs-idx/SVXDkTFQRaiuRcGdZIu6iQ]
  1> [2023-09-20T18:35:18,809][INFO ][o.o.c.m.MetadataMappingService] [node_s_0] [test-rs-idx/SVXDkTFQRaiuRcGdZIu6iQ] create_mapping
  1> [2023-09-20T18:35:27,437][INFO ][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] --> create first remote index shallow snapshot
  1> [2023-09-20T18:35:27,437][INFO ][o.o.r.RepositoriesService] [node_s_0] update repository [test-repo]
  1> [2023-09-20T18:35:27,485][INFO ][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] --> creating snapshot [test-snap-1] of [test-idx, test-rs-idx] in [test-repo]
  1> [2023-09-20T18:35:27,514][INFO ][o.o.s.SnapshotsService   ] [node_s_0] snapshot [test-repo:test-snap-1/2q4MFU80Sue7Z093iODCkQ] started
  1> [2023-09-20T18:35:27,523][INFO ][o.o.r.b.BlobStoreRepository] [node_s_0] [[test-rs-idx][0]] [test-snap-1/2q4MFU80Sue7Z093iODCkQ] snapshot to [test-repo] [_new] ...
  1> [2023-09-20T18:35:27,546][WARN ][o.o.s.SnapshotShardsService] [node_s_0] [[test-idx][0]][test-repo:test-snap-1/2q4MFU80Sue7Z093iODCkQ] failed to snapshot shard
  1> java.nio.file.NoSuchFileException: Metadata file is not present for given primary term 1 and generation 4
  1>    at org.opensearch.index.store.RemoteSegmentStoreDirectory.getMetadataFileForCommit(RemoteSegmentStoreDirectory.java:500) ~[main/:?]
  1>    at org.opensearch.index.store.RemoteSegmentStoreDirectory.acquireLock(RemoteSegmentStoreDirectory.java:442) ~[main/:?]
  1>    at org.opensearch.index.shard.IndexShard.acquireLockOnCommitData(IndexShard.java:1548) ~[main/:?]
  1>    at org.opensearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:409) [main/:?]
  1>    at org.opensearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:282) [main/:?]
  1>    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [main/:?]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1>    at java.lang.Thread.run(Thread.java:1623) [?:?]
  1> [2023-09-20T18:35:27,672][INFO ][o.o.s.SnapshotsService   ] [node_s_0] snapshot [test-repo:test-snap-1/2q4MFU80Sue7Z093iODCkQ] completed with state [PARTIAL]
  1> [2023-09-20T18:35:27,678][INFO ][o.o.r.b.BlobStoreRepositoryRemoteIndexTests] [testRetrieveShallowCopySnapshotCase2] after test
  1> [2023-09-20T18:35:27,681][INFO ][o.o.c.m.MetadataDeleteIndexService] [node_s_0] [test-idx/Mp2VSUHSTjSU5CV2fZWKfw] deleting index
  1> [2023-09-20T18:35:27,682][INFO ][o.o.c.m.MetadataDeleteIndexService] [node_s_0] [test-rs-idx/SVXDkTFQRaiuRcGdZIu6iQ] deleting index
  1> [2023-09-20T18:35:27,739][INFO ][o.o.i.t.t.TranslogTransferManager] [node_s_0] [test-idx][0] Deleted all remote translog data
  2> REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase2" -Dtests.seed=4C1E3EA4A4E022F9 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=en-US -Dtests.timezone=Portugal -Druntime.java=20
  2> java.lang.AssertionError:
    Expected: is <SUCCESS>
         but: was <PARTIAL>
        at __randomizedtesting.SeedInfo.seed([4C1E3EA4A4E022F9:5752428A1ED9A30]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:964)
        at org.junit.Assert.assertThat(Assert.java:930)
        at org.opensearch.repositories.blobstore.BlobStoreRepositoryHelperTests.createSnapshot(BlobStoreRepositoryHelperTests.java:160)
        at org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase2(BlobStoreRepositoryRemoteIndexTests.java:302)
  1> [2023-09-20T18:35:27,841][INFO ][o.o.n.Node               ] [suite] stopping ...
  1> [2023-09-20T18:35:27,846][INFO ][o.o.n.Node               ] [suite] stopped
  1> [2023-09-20T18:35:27,846][INFO ][o.o.n.Node               ] [suite] closing ...
  1> [2023-09-20T18:35:27,854][INFO ][o.o.n.Node               ] [suite] closed
  1> [2023-09-20T18:35:27,910][INFO ][o.o.i.t.t.TranslogTransferManager] [node_s_0] [test-rs-idx][0] Deleted all remote translog data
  1> [2023-09-20T18:35:27,917][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [node_s_0] Remote directory still has files , not deleting the path
  2> NOTE: leaving temporary files on disk at: /local/home/hbhakuni/workspace/os-opensource/OpenSearch/server/build/testrun/test/temp/org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests_4C1E3EA4A4E022F9-001
  2> NOTE: test params are: codec=Asserting(Lucene95): {index_uuid=Lucene90, type=Lucene90}, docValues:{}, maxPointsInLeafNode=1665, maxMBSortInHeap=7.490396343497602, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=en-US, timezone=Portugal
  2> NOTE: Linux 5.4.254-175.358.amzn2int.x86_64 amd64/Eclipse Adoptium 20.0.2 (64-bit)/cpus=8,threads=2,free=531453120,total=708837376
  2> NOTE: All tests run in this JVM: [BlobStoreRepositoryRemoteIndexTests]

Tests with failures:
 - org.opensearch.repositories.blobstore.BlobStoreRepositoryRemoteIndexTests.testRetrieveShallowCopySnapshotCase2

1 test completed, 1 failed

To Reproduce
reproduced this by running the above test in a loop and it fails after some 100th run. so this issue is not that frequent but happening due to some corner case.

@harishbhakuni harishbhakuni added bug Something isn't working untriaged labels Sep 25, 2023
@harishbhakuni
Copy link
Contributor Author

need remote translog folks help with this one. tagging @gbbafna and @sachinpkale for visibility.

@sachinpkale
Copy link
Member

Taking a look

@sachinpkale sachinpkale self-assigned this Sep 26, 2023
@sachinpkale sachinpkale added Storage Issues and PRs relating to data and metadata storage v2.11.0 Issues and PRs related to version 2.11.0 and removed untriaged labels Sep 28, 2023
@sachinpkale
Copy link
Member

  • If we have 2 refresh flows executing concurrently (one for scheduled refresh, another as part of flush), there is a race condition where the scheduled refresh uploads the new segments, new segments_N is created with flush and the refresh that is part of flush becomes a no-op as it does not find any new segments that are not part of the reader.
  • Due to this, the metadata file corresponding to commit generation on disk is not uploaded to remote store and results in NoSuchFileException.

Logs from test failure (prefixed to show the two refreshes executing concurrently):

Refresh 1 - [2023-10-03T16:27:30,424][INFO ][o.o.i.e.Engine           ] [node_s_0] [test-idx][0] --> In refresh, with source = schedule, block = false, scope = EXTERNAL
Refresh 1 - [2023-10-03T16:27:30,424][INFO ][o.o.i.e.I.ExternalReaderManager] [node_s_0] --> In ExternalReaderManager.refreshIfNeeded, before internalReaderManager.maybeRefreshBlocking()
Refresh 1 - [2023-10-03T16:27:30,424][INFO ][o.o.i.e.OpenSearchReaderManager] [node_s_0] --> Before calling openIfChanged
Refresh 1 - [2023-10-03T16:27:30,467][INFO ][o.o.i.e.OpenSearchReaderManager] [node_s_0] Reader post refresh. Generation: 3
Refresh 1 - [2023-10-03T16:27:30,471][DEBUG][o.o.i.s.RemoteStoreRefreshListener] [node_s_0] [test-idx][0] --> In afterRefresh
Refresh 1 - [2023-10-03T16:27:30,472][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node_s_0] [test-idx][0] --> shouldSync is true, triggering syncSegments. didRefresh = true

Flush     - [2023-10-03T16:27:30,478][TRACE][o.o.i.e.E.IW             ] [node_s_0] [test-idx][0] opensearch[node_s_0][flush][T#1] IW: commit: done writing segments file "segments_4"

Refresh 2 - [2023-10-03T16:27:30,481][INFO ][o.o.i.e.Engine           ] [node_s_0] [test-idx][0] --> In refresh, with source = version_table_flush, block = true, scope = INTERNAL

Refresh 1 - [2023-10-03T16:27:30,508][INFO ][o.o.i.e.I.ExternalReaderManager] [node_s_0] --> In ExternalReaderManager.refreshIfNeeded, after internalReaderManager.maybeRefreshBlocking()
Refresh 1 - [2023-10-03T16:27:30,509][INFO ][o.o.i.e.I.ExternalReaderManager] [node_s_0] --> In ExternalReaderManager.refreshIfNeeded, after internalReaderManager.acquire()
Refresh 2 - [2023-10-03T16:27:30,509][INFO ][o.o.i.e.OpenSearchReaderManager] [node_s_0] --> Before calling openIfChanged
Refresh 2 - [2023-10-03T16:27:30,509][INFO ][o.o.i.e.OpenSearchReaderManager] [node_s_0] Reader post refresh. Reader: null
Refresh 2 - [2023-10-03T16:27:30,509][DEBUG][o.o.i.s.RemoteStoreRefreshListener] [node_s_0] [test-idx][0] --> In afterRefresh
Refresh 2 - [2023-10-03T16:27:30,509][INFO ][o.o.i.s.RemoteStoreRefreshListener] [node_s_0] [test-idx][0] --> shouldSync is false, returning successful. didRefresh = false

Refresh 2 - [2023-10-03T16:27:30,509][INFO ][o.o.i.e.Engine           ] [node_s_0] [test-idx][0] --> Refresh completed, with source = version_table_flush, block = true
Refresh 1 - [2023-10-03T16:27:30,513][INFO ][o.o.i.e.Engine           ] [node_s_0] [test-idx][0] --> Refresh completed, with source = schedule, block = false

@sachinpkale
Copy link
Member

Complete log - test.zip

@sachinpkale
Copy link
Member

Ran 1500+ times on local without failure with a fix in #10341

@harishbhakuni
Copy link
Contributor Author

Good find @sachinpkale. Thanks for quickly root causing and fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage Issues and PRs relating to data and metadata storage v2.11.0 Issues and PRs related to version 2.11.0
Projects
Status: 2.11.0 - (Launched)
Development

No branches or pull requests

2 participants