Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Fix shard failure on flush due to upload timeout #10926

Merged
merged 3 commits into from
Oct 26, 2023

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Oct 25, 2023

Description

If there are connectivity issues with the configured remote store (or degraded remote store uploads) resulting in translog uploads taking longer than 30s, it leads in TimeoutException being thrown. This exception is later tried to be casted to RuntimeException which leads to ClassCastException which ultimately fails the shard.

Related Issues

Resolves #10924

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
    - [x] Commit changes are listed out in CHANGELOG.md file (See: Changelog)
    - [x] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage:Remote v2.12.0 Issues and PRs related to version 2.12.0 labels Oct 25, 2023
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Oct 25, 2023

Compatibility status:

Checks if related components are compatible with change 310277c

Incompatible components

Incompatible components: [https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git]

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov
Copy link

codecov bot commented Oct 25, 2023

Codecov Report

Merging #10926 (310277c) into main (44a9f18) will decrease coverage by 0.03%.
Report is 3 commits behind head on main.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##               main   #10926      +/-   ##
============================================
- Coverage     71.22%   71.20%   -0.03%     
+ Complexity    58721    58720       -1     
============================================
  Files          4872     4872              
  Lines        276682   276683       +1     
  Branches      40219    40219              
============================================
- Hits         197070   197013      -57     
- Misses        63163    63222      +59     
+ Partials      16449    16448       -1     
Files Coverage Δ
...dex/translog/transfer/TranslogTransferManager.java 83.33% <100.00%> (+3.95%) ⬆️

... and 448 files with indirect coverage changes

Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
Signed-off-by: Ashish Singh <ssashish@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member

mch2 commented Oct 25, 2023

Gradle Check (Jenkins) Run Completed with:

#10006

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2 mch2 merged commit fe8b2d5 into opensearch-project:main Oct 26, 2023
17 checks passed
@mch2 mch2 added the backport 2.x Backport to 2.x branch label Oct 26, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 26, 2023
(cherry picked from commit fe8b2d5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
gbbafna pushed a commit that referenced this pull request Oct 26, 2023
…) (#10939)

(cherry picked from commit fe8b2d5)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working skip-changelog Storage:Durability Issues and PRs related to the durability framework Storage:Remote v2.12.0 Issues and PRs related to version 2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Shard fails on flush due to translog upload taking longer than 30s
3 participants