Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tikv scale in failure in some cases #726

Merged
merged 6 commits into from
Aug 8, 2019

Conversation

onlymellb
Copy link
Contributor

What problem does this PR solve?

This PR fixes #725

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Manual test (add detailed scripts or steps below)

Code changes

  • Has Go code change

Side effects

Related changes

  • Need to cherry-pick to the release branch

Does this PR introduce a user-facing change?:

Fix tikv scale in failure in some cases after tikv failover

@onlymellb
Copy link
Contributor Author

/run-e2e-tests

@onlymellb
Copy link
Contributor Author

/run-e2e-tests

1 similar comment
@onlymellb
Copy link
Contributor Author

/run-e2e-tests

//
// 2. This can happen when TiKV pod has not been successfully registered in the cluster, such as always pending.
// In this situation we should delete this TiKV pod immediately to avoid blocking the subsequent operations.
if !podutil.IsPodReady(pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only handle Pending pods other than all not ready pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only pending, for example, considering this situation, the newly launched pod has been crashing and never really joined the tidb cluster. In this case, we can also scale in safely.

@gregwebs
Copy link
Contributor

gregwebs commented Aug 2, 2019

This logic is not so much Scale-In as it is updating the status of TiKV pods (and then when in Scale-In mode deciding to do something with that status). One reason the logic is placed here is to avoid returning an error and halting the sync process. However, in my PR #581 by raising a RequeueError one does not stop the sync loop. It is possible to do something similar here.

Note that my PR incidentally fixed a few minor bugs showing up in test cases due to dealing with halting sync at the right time.

@onlymellb
Copy link
Contributor Author

Here we need to scale in immediately when it can be reduced safely, instead of generating a RequeueError waiting for the next round of sync (consider this situation, if the tikv pod has been pending or unhealthy,it has never really joined the tidb cluster during this period.), this situation can lead to an inability to scale in, no matter how many sync loops have passed. refer to issue #725

Copy link
Contributor

@weekface weekface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@onlymellb
Copy link
Contributor Author

/run-e2e-tests

@xiaojingchen xiaojingchen merged commit 799930d into pingcap:master Aug 8, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Aug 8, 2019

cherry pick to release-1.0 in PR #742

@onlymellb onlymellb deleted the fix-tikv-scale-in-failed branch August 8, 2019 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tikv failover may cause scale in failure in some cases
5 participants