Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

tests: fix br_tikv_outage learner bug #1256

Merged
merged 6 commits into from
Jun 25, 2021
Merged

tests: fix br_tikv_outage learner bug #1256

merged 6 commits into from
Jun 25, 2021

Conversation

Leavrth
Copy link
Collaborator

@Leavrth Leavrth commented Jun 18, 2021

What problem does this PR solve?

fix issue #1050
In the case outage-after-request, there is a voter peer on tikv1 removed, and then a learner peer on tikv1 added.
But then quickly in the next case outage-at-finegrained (don't restart tikv while doing backup), tikv3 is killed.
So there is only 1 peer and 1 learner for region 257, lead to test got stuck.

What is changed and how it works?

It seems the case outage-after-request have a bad influence on the next case, so this PR try to exchange the sequence of these cases.

Check List

Related changes

  • Need to cherry-pick to the release branch

Release note

-No release note

@Leavrth Leavrth marked this pull request as draft June 18, 2021 10:04
@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 18, 2021

/run-integration-test

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 18, 2021

/run-integration-test

6 similar comments
@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 19, 2021

/run-integration-test

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 21, 2021

/run-integration-test

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 21, 2021

/run-integration-test

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 21, 2021

/run-integration-test

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 21, 2021

/run-integration-test

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 21, 2021

/run-integration-test

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 21, 2021

@YuJuncen PTAL

@Leavrth Leavrth linked an issue Jun 21, 2021 that may be closed by this pull request
@Leavrth Leavrth marked this pull request as ready for review June 21, 2021 03:51
@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 21, 2021

/run-integration-test

Co-authored-by: kennytm <kennytm@gmail.com>
@ti-chi-bot ti-chi-bot added the status/LGT1 LGTM1 label Jun 23, 2021
@glorv
Copy link
Collaborator

glorv commented Jun 24, 2021

This sounds weird to me, I remember the scripts will recover the cluster to normal state before run each case, thus cases won't have side effect for each other. Are we sure this is the root cause? This case had already been fixed for several times 😭, is this the last time

@Leavrth
Copy link
Collaborator Author

Leavrth commented Jun 25, 2021

This sounds weird to me, I remember the scripts will recover the cluster to normal state before run each case, thus cases won't have side effect for each other. Are we sure this is the root cause? This case had already been fixed for several times 😭, is this the last time

This test has many bugs which all can make test failed, And I think this is the last bug.
In fact, the recover in scripts only make sure that the cluster is already initialized before running each case. (This is fixed in other PR #1160 )
As shown above, even though this killed node has been restarted and initialized, there are still some peers in this node hasn't been synchronized with other peers in other nodes, which make these peers keep learner state.
At this point, if we kill another node, these learner state peers might be failed to synchronize itself, which makes the test timeout.

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • glorv
  • kennytm

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added status/LGT2 LGTM2 and removed status/LGT1 LGTM1 labels Jun 25, 2021
@kennytm
Copy link
Collaborator

kennytm commented Jun 25, 2021

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 0ce3aa1

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1281.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1282.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1283.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

br_tikv_outage sometimes got stuck
4 participants