Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

online recovery: fix online recovery timeout mechanism #6108

Merged
merged 4 commits into from
Mar 8, 2023

Conversation

Connor1996
Copy link
Member

@Connor1996 Connor1996 commented Mar 7, 2023

What problem does this PR solve?

Issue Number: Close #6107

What is changed and how does it work?

fix online recovery timeout mechanism

Check List

Tests

  • Unit test

Release note

Fix the issue that online recovery timeout may not work as expected

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Mar 7, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • nolouch
  • rleungx

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Mar 7, 2023
@ti-chi-bot ti-chi-bot added needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. and removed do-not-merge/needs-triage-completed labels Mar 7, 2023
@Connor1996
Copy link
Member Author

PTAL @v01dstar

@v01dstar
Copy link
Contributor

v01dstar commented Mar 7, 2023

Can you please explain a little bit more about what the bug is? I can tell that with this change, the whole process will exit faster when timeout happens (the old / existing also exit after timeout, i believe? ). Besides, I think, with this change, we may leave some regions in exit force leader state when timeout?

@Connor1996
Copy link
Member Author

Connor1996 commented Mar 8, 2023

Can you please explain a little bit more about what the bug is? I can tell that with this change, the whole process will exit faster when timeout happens (the old / existing also exit after timeout, i believe? ). Besides, I think, with this change, we may leave some regions in exit force leader state when timeout?

Suppose that, one TiKV always returns store heartbeat but without store report for somewhat reason. Then in the existing impl, it would never trigger timeout and keep in the collecting stage forever.

@v01dstar
Copy link
Contributor

v01dstar commented Mar 8, 2023

Can you please explain a little bit more about what the bug is? I can tell that with this change, the whole process will exit faster when timeout happens (the old / existing also exit after timeout, i believe? ). Besides, I think, with this change, we may leave some regions in exit force leader state when timeout?

Suppose that, one TiKV always returns store heartbeat but without store report for somewhat reason. Then in the existing impl, it would never trigger timeout and keep in the collecting stage forever.

I think the existing code still exit? Just with a longer wait time, timeout + 2* store_heartbeat_interval. Am I miss something here?

@Connor1996
Copy link
Member Author

Connor1996 commented Mar 8, 2023

Can you please explain a little bit more about what the bug is? I can tell that with this change, the whole process will exit faster when timeout happens (the old / existing also exit after timeout, i believe? ). Besides, I think, with this change, we may leave some regions in exit force leader state when timeout?

Suppose that, one TiKV always returns store heartbeat but without store report for somewhat reason. Then in the existing impl, it would never trigger timeout and keep in the collecting stage forever.

I think the existing code still exit? Just with a longer wait time, timeout + 2* store_heartbeat_interval. Am I miss something here?

Please check checkTimeout in existing code, it never changes the stage to failed except it's exit force leader stage.

// blocks reads and writes.
u.storePlanExpires = make(map[uint64]time.Time)
u.storeRecoveryPlans = make(map[uint64]*pdpb.RecoveryPlan)
u.timeout = time.Now().Add(storeRequestInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't one heartbeat interval too aggressive? Maybe *2 to make it more stable?

@ti-chi-bot
Copy link
Member

@v01dstar: Thanks for your review. The bot only counts approvals from reviewers and higher roles in list, but you're still welcome to leave your comments.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@codecov
Copy link

codecov bot commented Mar 8, 2023

Codecov Report

Patch coverage: 87.91% and project coverage change: +0.08 🎉

Comparison is base (b3e7a76) 74.03% compared to head (22eec8a) 74.12%.

❗ Current head 22eec8a differs from pull request most recent head 8daf7c9. Consider uploading reports for the commit 8daf7c9 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6108      +/-   ##
==========================================
+ Coverage   74.03%   74.12%   +0.08%     
==========================================
  Files         385      385              
  Lines       37952    37952              
==========================================
+ Hits        28099    28131      +32     
+ Misses       7377     7353      -24     
+ Partials     2476     2468       -8     
Flag Coverage Δ
unittests 74.12% <87.91%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/cluster/unsafe_recovery_controller.go 82.62% <87.91%> (+3.91%) ⬆️
pkg/errs/errs.go 75.00% <0.00%> (-25.00%) ⬇️
pkg/utils/tempurl/tempurl.go 60.00% <0.00%> (-10.00%) ⬇️
...erver/config/service_middleware_persist_options.go 91.66% <0.00%> (-8.34%) ⬇️
pkg/tso/local_allocator.go 65.78% <0.00%> (-6.58%) ⬇️
client/resource_group/controller/limiter.go 61.25% <0.00%> (-6.25%) ⬇️
server/region_syncer/server.go 82.96% <0.00%> (-4.40%) ⬇️
server/schedule/schedulers/random_merge.go 62.50% <0.00%> (-4.17%) ⬇️
server/schedule/labeler/labeler.go 76.77% <0.00%> (-2.59%) ⬇️
pkg/election/lease.go 84.05% <0.00%> (-1.45%) ⬇️
... and 20 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 8, 2023
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 8, 2023
@nolouch
Copy link
Contributor

nolouch commented Mar 8, 2023

/merge

@ti-chi-bot
Copy link
Member

@nolouch: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 22eec8a

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 8, 2023
@ti-chi-bot
Copy link
Member

@Connor1996: Your PR was out of date, I have automatically updated it for you.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.1: #6111.

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Mar 8, 2023
close tikv#6107

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #6112.

v01dstar pushed a commit to v01dstar/pd that referenced this pull request Mar 29, 2023
close tikv#6107

fix online recovery timeout mechanism

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>
ti-chi-bot added a commit that referenced this pull request Mar 30, 2023
close #6107, ref #6108, ref #6111

fix online recovery timeout mechanism

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: Yang Zhang <yang.zhang@pingcap.com>

Co-authored-by: Connor <zbk602423539@gmail.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this pull request Mar 31, 2023
close #6107, ref #6108

fix online recovery timeout mechanism

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Co-authored-by: Connor1996 <zbk602423539@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Online recovery timeout doesn't work
5 participants