-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not fail snapshot when deleting a missing snapshotted file #30332
Do not fail snapshot when deleting a missing snapshotted file #30332
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me as if only finalization needs special treatment. I would prefer that we first write the new index file before we start cleaning up the remaining index files.
@@ -780,7 +780,7 @@ private void writeAtomic(final String blobName, final BytesReference bytesRef) t | |||
} catch (IOException ex) { | |||
// temporary blob creation or move failed - try cleaning up | |||
try { | |||
snapshotsBlobContainer.deleteBlob(tempBlobName); | |||
snapshotsBlobContainer.deleteBlobIgnoringIfNotExists(tempBlobName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not change anything here? We are already catching the NoSuchFileException
in the line below, which is an IOException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something goes wrong and an IOException is thrown when writing the temporary blob in
snapshotsBlobContainer.writeBlob(tempBlobName,..)
It means that in the catch block the deletion of the temp blob will fail, and adding this NoSuchFileException as a suppressed exception here looks like noise to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -962,7 +962,7 @@ protected void finalize(List<SnapshotFiles> snapshots, int fileListGeneration, M | |||
if (blobName.startsWith(DATA_BLOB_PREFIX)) { | |||
if (newSnapshots.findNameFile(BlobStoreIndexShardSnapshot.FileInfo.canonicalName(blobName)) == null) { | |||
try { | |||
blobContainer.deleteBlob(blobName); | |||
blobContainer.deleteBlobIgnoringIfNotExists(blobName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not change anything here? We are already catching the NoSuchFileException in the line below, which is an IOException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the behavior is the same but it just make more explicit that it is ok to try to delete a missing file here.
@@ -946,7 +946,7 @@ protected void finalize(List<SnapshotFiles> snapshots, int fileListGeneration, M | |||
for (String blobName : blobs.keySet()) { | |||
if (indexShardSnapshotsFormat.isTempBlobName(blobName) || blobName.startsWith(SNAPSHOT_INDEX_PREFIX)) { | |||
try { | |||
blobContainer.deleteBlob(blobName); | |||
blobContainer.deleteBlobIgnoringIfNotExists(blobName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the comment below this:
// We cannot delete index file - this is fatal, we cannot continue, otherwise we might end up with references to non-existing files with references to non-existing files
This sounds very dangerous, and I wonder if we need special precaution here. I wonder for example if we should first write the new index file before starting to delete old ones. It sounds to me as if finalization bears the risk of losing all index files if there is a process crash between deleting old ones and writing new one. We currently compensate for this in buildBlobStoreIndexShardSnapshots
by doing a file listing and loading individual snapshots, but I find it odd to rely on such a fallback logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I wanted to look at this too. I agree we should write the shard index file first, fail hard if something goes wrong, and then delete the files that are not needed anymore. We can still keep the individual file listing in buildBlobStoreIndexShardSnapshots
.
@@ -139,7 +139,7 @@ public void writeAtomic(T obj, BlobContainer blobContainer, String name) throws | |||
} catch (IOException ex) { | |||
// Move failed - try cleaning up | |||
try { | |||
blobContainer.deleteBlob(tempBlobName); | |||
blobContainer.deleteBlobIgnoringIfNotExists(tempBlobName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not change anything here? We are already catching the NoSuchFileException in the line below, which is an IOException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - in this case it must not swallow the NoSuchFileException
Thanks @ywelsch. I updated the code and commented. Can you have another look please? |
@@ -780,7 +780,7 @@ private void writeAtomic(final String blobName, final BytesReference bytesRef) t | |||
} catch (IOException ex) { | |||
// temporary blob creation or move failed - try cleaning up | |||
try { | |||
snapshotsBlobContainer.deleteBlob(tempBlobName); | |||
snapshotsBlobContainer.deleteBlobIgnoringIfNotExists(tempBlobName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -930,54 +931,53 @@ public BlobStoreIndexShardSnapshot loadSnapshot() { | |||
} | |||
|
|||
/** | |||
* Removes all unreferenced files from the repository and writes new index file | |||
* Writes a new shard's index file and removes all unreferenced files from the repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writes a new index file for the shard and removes
...
} | ||
// Delete old index files | ||
for (final String blobName : blobs.keySet()) { | ||
if (blobName.startsWith(SNAPSHOT_INDEX_PREFIX) && (blobName.equals(currentIndexGen) == false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would the current index gen be part of the blobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's an unnecessary extra safety, I'll remove this.
// Delete old index files | ||
for (final String blobName : blobs.keySet()) { | ||
if (blobName.startsWith(SNAPSHOT_INDEX_PREFIX) && (blobName.equals(currentIndexGen) == false)) { | ||
blobContainer.deleteBlobIgnoringIfNotExists(blobName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still nice to keep the debug logging as we had before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After debugging many snapshot/restore issues I've never found them useful as the IOException almost always contains the blobname and is logged at a higher level.
// Delete all blobs that don't exist in a snapshot | ||
for (final String blobName : blobs.keySet()) { | ||
if (blobName.startsWith(DATA_BLOB_PREFIX) && (updatedSnapshots.findNameFile(canonicalName(blobName)) == null)) { | ||
blobContainer.deleteBlobIgnoringIfNotExists(blobName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before, we would ignore any IOException, now we only ignore NoSuchFileException. I think, for optional clean-up, the previous one is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we should blindly ignore IOException here (except NoSuchFileException which is harmless because we want to delete the file anyway). It could hide a permission issue, and let unnecessary files in the repository that could be deleted if it was noticed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we should blindly ignore IOException here
I'm not a fan of blindly ignoring IOException here either. But I don't think it should the fail the snapshot which successfully completed. We can log the IOExceptions which are not NoSuchFileException as warning to make them more prominent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a good compromise, yes. Thanks!
} | ||
} catch (IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred the exception handling how it was before, i.e. local to each action. This allows better debug messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree but OK, I'll add them back.
@ywelsch I updated the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
5c10ea1
to
da688c2
Compare
@elasticmachine test this please |
Thanks @ywelsch |
When deleting or creating a snapshot for a given shard, elasticsearch usually starts by listing all the existing snapshotted files in the repository. Then it computes a diff and deletes the snapshotted files that are not needed anymore. During this deletion, an exception is thrown if the file to be deleted does not exist anymore. This behavior is challenging with cloud based repository implementations like S3 where a file that has been deleted can still appear in the bucket for few seconds/minutes (because the deletion can take some time to be fully replicated on S3). If the deleted file appears in the listing of files, then the following deletion will fail with a NoSuchFileException and the snapshot will be partially created/deleted. This pull request makes the deletion of these files a bit less strict, ie not failing if the file we want to delete does not exist anymore. It introduces a new BlobContainer.deleteIgnoringIfNotExists() method that can be used at some specific places where not failing when deleting a file is considered harmless. Closes #28322
When deleting or creating a snapshot for a given shard, elasticsearch usually starts by listing all the existing snapshotted files in the repository. Then it computes a diff and deletes the snapshotted files that are not needed anymore. During this deletion, an exception is thrown if the file to be deleted does not exist anymore. This behavior is challenging with cloud based repository implementations like S3 where a file that has been deleted can still appear in the bucket for few seconds/minutes (because the deletion can take some time to be fully replicated on S3). If the deleted file appears in the listing of files, then the following deletion will fail with a NoSuchFileException and the snapshot will be partially created/deleted. This pull request makes the deletion of these files a bit less strict, ie not failing if the file we want to delete does not exist anymore. It introduces a new BlobContainer.deleteIgnoringIfNotExists() method that can be used at some specific places where not failing when deleting a file is considered harmless. Closes #28322
…r-you * origin/master: Update forcemerge.asciidoc (elastic#30113) Added zentity to the list of API extension plugins (elastic#29143) Fix the search request default operation behavior doc (elastic#29302) (elastic#29405) Watcher: Mark watcher as started only after loading watches (elastic#30403) Pass the task to broadcast actions (elastic#29672) Disable REST default settings testing until elastic#29229 is back-ported Correct wording in log message (elastic#30336) Do not fail snapshot when deleting a missing snapshotted file (elastic#30332) AwaitsFix testCreateShrinkIndexToN DOCS: Correct mapping tags in put-template api DOCS: Fix broken link in the put index template api Add put index template api to high level rest client (elastic#30400) Relax testAckedIndexing to allow document updating [Docs] Add snippets for POS stop tags default value Move respect accept header on no handler to 6.3.1
* origin/master: (39 commits) Docs: fix changelog merge Fix line length violation in cache tests Add stricter geohash parsing (elastic#30376) Add failing test for core cache deadlock [DOCS] convert forcemerge snippet Update forcemerge.asciidoc (elastic#30113) Added zentity to the list of API extension plugins (elastic#29143) Fix the search request default operation behavior doc (elastic#29302) (elastic#29405) Watcher: Mark watcher as started only after loading watches (elastic#30403) Pass the task to broadcast actions (elastic#29672) Disable REST default settings testing until elastic#29229 is back-ported Correct wording in log message (elastic#30336) Do not fail snapshot when deleting a missing snapshotted file (elastic#30332) AwaitsFix testCreateShrinkIndexToN DOCS: Correct mapping tags in put-template api DOCS: Fix broken link in the put index template api Add put index template api to high level rest client (elastic#30400) Relax testAckedIndexing to allow document updating [Docs] Add snippets for POS stop tags default value Move respect accept header on no handler to 6.3.1 ...
* elastic-master: Watcher: Mark watcher as started only after loading watches (#30403) Pass the task to broadcast actions (#29672) Disable REST default settings testing until #29229 is back-ported Correct wording in log message (#30336) Do not fail snapshot when deleting a missing snapshotted file (#30332) AwaitsFix testCreateShrinkIndexToN DOCS: Correct mapping tags in put-template api DOCS: Fix broken link in the put index template api Add put index template api to high level rest client (#30400) Relax testAckedIndexing to allow document updating [Docs] Add snippets for POS stop tags default value Move respect accept header on no handler to 6.3.1 Respect accept header on no handler (#30383) [Test] Add analysis-nori plugin to the vagrant tests [Docs] Fix bad link [Docs] Fix end of section in the korean plugin docs Expose the Lucene Korean analyzer module in a plugin (#30397) Docs: remove transport_client from CCS role example (#30263) [Rollup] Validate timezone in range queries (#30338) Use readFully() to read bytes from CipherInputStream (#28515) Fix docs Recently merged #29229 had a doc bug that broke the doc build. This commit fixes. Test: remove cluster permission from CCS user (#30262) Add Get Settings API support to java high-level rest client (#29229) Watcher: Remove unneeded index deletion in tests
When deleting or creating a snapshot for a given shard, elasticsearch usually starts by listing all the existing snapshotted files in the repository. Then it computes a diff and deletes the snapshotted files that are not needed anymore. During this deletion, an exception is thrown if the file to be deleted does not exist anymore. This behavior is challenging with cloud based repository implementations like S3 where a file that has been deleted can still appear in the bucket for few seconds/minutes (because the deletion can take some time to be fully replicated on S3). If the deleted file appears in the listing of files, then the following deletion will fail with a NoSuchFileException and the snapshot will be partially created/deleted. This pull request makes the deletion of these files a bit less strict, ie not failing if the file we want to delete does not exist anymore. It introduces a new BlobContainer.deleteIgnoringIfNotExists() method that can be used at some specific places where not failing when deleting a file is considered harmless. Closes #28322
* 6.x: Stop forking javac (#30462) Fix tribe tests Docs: Use task_id in examples of tasks (#30436) Security: Rename IndexLifecycleManager to SecurityIndexManager (#30442) Packaging: Set elasticsearch user to have non-existent homedir (#29007) [Docs] Fix typo in cardinality-aggregation.asciidoc (#30434) Avoid NPE in `more_like_this` when field has zero tokens (#30365) Build: Switch to building javadoc with html5 (#30440) Add a quick tour of the project to CONTRIBUTING (#30187) Add stricter geohash parsing (#30376) Reindex: Use request flavored methods (#30317) Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure. (#30432) Auto-expand replicas when adding or removing nodes (#30423) Silence IndexUpgradeIT test failures. (#30430) Fix line length violation in cache tests Add failing test for core cache deadlock [DOCS] convert forcemerge snippet Update forcemerge.asciidoc (#30113) Added zentity to the list of API extension plugins (#29143) Fix the search request default operation behavior doc (#29302) (#29405) Watcher: Mark watcher as started only after loading watches (#30403) Correct wording in log message (#30336) Do not fail snapshot when deleting a missing snapshotted file (#30332) AwaitsFix testCreateShrinkIndexToN DOCS: Correct mapping tags in put-template api DOCS: Fix broken link in the put index template api Add put index template api to high level rest client (#30400) [Docs] Add snippets for POS stop tags default value Remove entry inadvertently picked into changelog Move respect accept header on no handler to 6.3.1 Respect accept header on no handler (#30383) [Test] Add analysis-nori plugin to the vagrant tests [Rollup] Validate timezone in range queries (#30338) [Docs] Fix bad link [Docs] Fix end of section in the korean plugin docs add the Korean nori plugin to the change logs Expose the Lucene Korean analyzer module in a plugin (#30397) Docs: remove transport_client from CCS role example (#30263) Test: remove cluster permission from CCS user (#30262) Watcher: Remove unneeded index deletion in tests fix docs branch version fix lucene snapshot version Upgrade to 7.4.0-snapshot-1ed95c097b (#30357) [ML][TEST] Clean up jobs in ModelPlotIT Watcher: Ensure trigger service pauses execution (#30363) [DOCS] Fixes ordering of changelog sections [DOCS] Commented out empty sections in the changelog to fix the doc build. (#30372) Make RepositoriesMetaData contents unmodifiable (#30361) Change signature of Get Repositories Response (#30333) 6.x Backport: Terms query validate bug (#30319) InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (#30121) Security: reduce garbage during index resolution (#30180) Test: use trial license in qa tests with security [ML] Add integration test for model plots (#30359) SQL: Fix bug caused by empty composites (#30343) [ML] Account for gaps in data counts after job is reopened (#30294) [ML] Refactor DataStreamDiagnostics to use array (#30129) Make licensing FIPS-140 compliant (#30251) Do not load global state when deleting a snapshot (#29278) Don't load global state when only restoring indices (#29239) Tests: Use different watch ids per test in smoke test (#30331) Watcher: Make start/stop cycle more predictable and synchronous (#30118) [Docs] Add term query with normalizer example Adds Eclipse config for xpack licence headers (#30299) Fix message content in users tool (#30293) [DOCS] Removed X-Pack breaking changes page [DOCS] Added security breaking change [DOCS] Fixes link to TLS LDAP info [DOCS] Merges X-Pack release notes into changelog (#30350) [DOCS] Fixes broken links to bootstrap user (#30349) [Docs] Remove errant changelog line Fix NPE when CumulativeSum agg encounters null/empty bucket (#29641) [DOCS] Reorganizes authentication details in Stack Overview (#30280) Tests: Simplify VersionUtils released version splitting (#30322) Fix merging logic of Suggester Options (#29514) ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316) [DOCS] Adds LDAP realm configuration details (#30214) [DOCS] Adds native realm configuration details (#30215) Disable SSL on testing old BWC nodes (#30337) [DOCS] Enables edit links for X-Pack pages Cancelling a peer recovery on the source can leak a primary permit (#30318) SQL: Reduce number of ranges generated for comparisons (#30267) [DOCS] Adds links to changelog sections Convert server javadoc to html5 (#30279) REST Client: Add Request object flavored methods (#29623) Create default ES_TMPDIR on Windows (#30325) [Docs] Clarify `fuzzy_like_this` redirect (#30183) Fix docs of the `_ignored` meta field. Add a new `_ignored` meta field. (#29658) Move repository-azure fixture test to QA project (#30253)
When deleting or creating a snapshot for a given shard, elasticsearch usually starts by listing all the existing snapshotted files in the repository. Then it computes a diff and deletes the snapshotted files that are not needed anymore. During this deletion, an exception is thrown if the file to be deleted does not exist anymore.
This behavior is challenging with cloud based repository implementations like S3 where a file that has been deleted can still appear in the bucket for few seconds/minutes (because the deletion can take some time to be fully replicated on S3). If the deleted file appears in the listing of files, then the following deletion will fail with a NoSuchFileException and the snapshot will be partially created/deleted.
This pull request makes the deletion of these files a bit less strict, ie not failing if the file we want to delete does not exist anymore. It introduces a new
BlobContainer.deleteIgnoringIfNotExists()
method that can be used at some specific places where not failing when deleting a file is considered harmless.Closes #28322