Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snapshot Interop][Bug Fix] Ensure lock file gets deleted if snapshot … #12016

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

harishbhakuni
Copy link
Contributor

@harishbhakuni harishbhakuni commented Jan 25, 2024

...shard md upload fails during snapshot creation.

Description

  • While working on another change, realized that in the following code, everything below snapshotRemoteStoreIndexShard call was not getting executed:

    repository.snapshotRemoteStoreIndexShard(
    indexShard.store(),
    snapshot.getSnapshotId(),
    indexId,
    snapshotIndexCommit,
    getShardStateId(indexShard, snapshotIndexCommit),
    snapshotStatus,
    primaryTerm,
    startTime,
    ActionListener.runBefore(listener, wrappedSnapshot::close)
    );
    } catch (IndexShardSnapshotFailedException e) {
    logger.error(
    "Shallow Copy Snapshot Failed for Shard ["
    + indexId.getName()
    + "]["
    + shardId.getId()
    + "] for snapshot "
    + snapshot.getSnapshotId()
    + ", releasing acquired lock from remote store"
    );
    indexShard.releaseLockOnCommitData(snapshot.getSnapshotId().getUUID(), primaryTerm, commitGeneration);
    throw e;
    }
    long endTime = threadPool.relativeTimeInMillis();
    logger.debug(
    "Time taken (in milliseconds) to complete shallow copy snapshot, "
    + "for index "
    + indexId.getName()
    + ", shard "
    + shardId.getId()
    + " and snapshot "
    + snapshot.getSnapshotId()
    + " is "
    + (endTime - startTime)
    );

  • that was causing an issue where if shard md upload to snapshot repository fails, it will not release the lock file from S3.

  • as part of this PR, making sure those lines of code get executed even if shard md upload fails and adding IT to cover that scenario.

  • Also, Adding a log line to print the time taken in flush operation during snapshot creation and another minor change to filter lock files based on lock suffix while fetching lock files from lock directory.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@harishbhakuni harishbhakuni self-assigned this Jan 25, 2024
Copy link
Contributor

github-actions bot commented Jan 25, 2024

Compatibility status:

Checks if related components are compatible with change 197ccfd

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/k-nn.git]

Copy link
Contributor

❌ Gradle check result for b8ec898: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harishbhakuni harishbhakuni changed the title [Snapshot Interop][Bug Fix] Make sure lock file gets deleted if snaps… [Snapshot Interop][Bug Fix] Ensure lock file gets deleted if snapshot … Jan 25, 2024
@harishbhakuni
Copy link
Contributor Author

Test org.opensearch.search.sort.FieldSortIT.testSimpleSorts {p0={"search.concurrent_segment_search.enabled":"true"}} is failing which is flaky and being tracked as part of #11875 .

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Apr 8, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels May 9, 2024
Copy link
Contributor

github-actions bot commented Jun 4, 2024

❌ Gradle check result for 4cdd33e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@harishbhakuni
Copy link
Contributor Author

Copy link
Contributor

github-actions bot commented Jun 4, 2024

❌ Gradle check result for 203e404: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jun 5, 2024

❌ Gradle check result for 184132e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jun 6, 2024

❌ Gradle check result for 1ebbc06: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…hot shard md upload fails during snapshot creation.

Signed-off-by: Harish Bhakuni <hbhakuni@amazon.com>
Copy link
Contributor

github-actions bot commented Jun 6, 2024

❌ Gradle check result for 57799ac: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@gbbafna
Copy link
Collaborator

gbbafna commented Jun 11, 2024

"that was causing an issue where if shard md upload to snapshot repository fails, it will not release the lock file from S3." - How is that happening ?

} catch (IndexShardSnapshotFailedException e) {
logger.error(
"Shallow Copy Snapshot Failed for Shard ["
+ indexId.getName()
+ "]["
+ shardId.getId()
+ "] for snapshot "
+ snapshot.getSnapshotId()
+ ", releasing acquired lock from remote store"
);
indexShard.releaseLockOnCommitData(snapshot.getSnapshotId().getUUID(), primaryTerm, commitGeneration);
throw e;
}
long endTime = threadPool.relativeTimeInMillis();
logger.debug(
"Time taken (in milliseconds) to complete shallow copy snapshot, "
+ "for index "
+ indexId.getName()
+ ", shard "
+ shardId.getId()
+ " and snapshot "
+ snapshot.getSnapshotId()
+ " is "
+ (endTime - startTime)
);

The above block is doing exactly that

@harishbhakuni
Copy link
Contributor Author

The above block is doing exactly that

not really, the problem was that caller method snapshotRemoteStoreIndexShard was catching the exception and passing it with listener.onFailure() method. so the catch block where we were releasing lock will never get executed. just moved that code to onFailure() method. so that it will get executed.

+ snapshot.getSnapshotId()
+ " is "
+ (endTime - startTime)
GatedCloseable<IndexCommit> finalWrappedSnapshot = wrappedSnapshot;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should throw IndexShardSnapshotFailedException along with listener.onFailure(e);

return new String[0];
}
// filtering lock files from lock directory contents.
// this is a good to have check, there is no known prod scenarios where this can happen
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : lets not use prod word . Also a user can create extra files in the lock folder as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants