Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: fix region scatter may transfer leader to removed peer #1482

Merged
merged 11 commits into from
Apr 2, 2019

Conversation

nolouch
Copy link
Contributor

@nolouch nolouch commented Mar 27, 2019

Signed-off-by: nolouch nolouch@gmail.com

What problem does this PR solve?

Somtimes region scatter may timeout, it cause by:

  • Leader already changed but scatter-region step still use old leader
  • Leader may transfer to a removed peer

log like:

[region 2109921] operator timeout: scatter-region (kind:leader,region,admin, region:2109921(5233,677), createAt:2019-03-26 14:29:14.885583251 +0800 CST m=+1888726.900848985, currentStep:10, steps:[add learner peer 2109925 on store 5 promote learner peer 2109925 on store 5 to voter remove peer on store 8 transfer leader from store 0 to store 5 add learner peer 2109926 on store 6 promote learner peer 2109926 on store 6 to voter remove peer on store 4 transfer leader from store 0 to store 6 add learner peer 2109927 on store 1 promote learner peer 2109927 on store 1 to voter transfer leader from store 7 to store 8 remove peer on store 7 transfer leader from store 0 to store 1]) timeout

add  store 5
remove store 8
transfer to 5
add store 6
remove store 4
transfer to 6
add  store 1 
transfer from 7 to 8
remove  store 7
transfer to 1

the origin leader is on store 7.

What is changed and how it works?

Fix the operator create logic.

Check List

Tests

  • Unit test

@nolouch nolouch added type/bug The issue is confirmed as a bug. needs-cherry-pick-release-2.0 The PR needs to cherry pick to release-2.0 branch. needs-cherry-pick-release-2.1 The PR needs to cherry pick to release-2.1 branch. and removed needs-cherry-pick-release-2.0 The PR needs to cherry pick to release-2.0 branch. labels Mar 27, 2019
Signed-off-by: nolouch <nolouch@gmail.com>
@disksing
Copy link
Contributor

@nolouch The log you provided does not transfer leader to removed peer. The steps are:

- add 5
- transfer 2
- remove 1
- transfer 5
- add 4
- remove 2
- transfer 4

@nolouch nolouch changed the title scheduler: fix region scatter may transfer leader to removed peer scheduler: fix region scatter may transfer leader from removed peer Mar 27, 2019
@nolouch nolouch changed the title scheduler: fix region scatter may transfer leader from removed peer scheduler: fix region scatter may transfer leader to removed peer Mar 27, 2019
@nolouch
Copy link
Contributor Author

nolouch commented Mar 27, 2019

@disksing I update the log, this one is pick from test cluster.

@rleungx
Copy link
Member

rleungx commented Mar 28, 2019

@nolouch CI failed.

@nolouch nolouch added the DNM label Mar 28, 2019
Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch nolouch removed the DNM label Apr 1, 2019
Signed-off-by: nolouch <nolouch@gmail.com>
Signed-off-by: nolouch <nolouch@gmail.com>
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@56f4100). Click here to learn what that means.
The diff coverage is 78.02%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master   #1482   +/-   ##
========================================
  Coverage          ?   67.4%           
========================================
  Files             ?     158           
  Lines             ?   15454           
  Branches          ?       0           
========================================
  Hits              ?   10417           
  Misses            ?    4092           
  Partials          ?     945
Impacted Files Coverage Δ
server/grpc_service.go 53.75% <0%> (ø)
server/schedule/mockcluster.go 83.41% <0%> (ø)
server/handler.go 55.58% <33.33%> (ø)
server/schedule/operator.go 86.26% <68.42%> (ø)
server/schedule/region_scatterer.go 88.14% <90.47%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56f4100...4bf7f45. Read the comment docs.

@nolouch
Copy link
Contributor Author

nolouch commented Apr 1, 2019

PTAL @disksing @rleungx

steps := make([]OperatorStep, 0, len(targetPeers)*2+1)
deferSteps := make([]OperatorStep, 0, 2)
var kind OperatorKind
sameLeader := targetLeaderPeer.GetStoreId() == origin.GetLeader().GetStoreId()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a variable for origin.GetLeader().GetStoreId()


storeIDs := origin.GetStoreIds()
steps := make([]OperatorStep, 0, len(targetPeers)*2+1)
deferSteps := make([]OperatorStep, 0, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add some comments to explain why we put these steps in the end.

Signed-off-by: nolouch <nolouch@gmail.com>
@nolouch nolouch merged commit 2d3e5fa into tikv:master Apr 2, 2019
@nolouch nolouch deleted the fix-scatter branch April 2, 2019 06:47
nolouch added a commit to nolouch/pd that referenced this pull request Apr 2, 2019
…kv#1482)

* scheduler: fix region scatter may transfer leader to removed peer

Signed-off-by: nolouch <nolouch@gmail.com>
nolouch added a commit that referenced this pull request Apr 3, 2019
* scheduler: fix region scatter may transfer leader to removed peer (#1482)

* scheduler: fix region scatter may transfer leader to removed peer

Signed-off-by: nolouch <nolouch@gmail.com>

* fix get hot store (#1487)

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-2.1 The PR needs to cherry pick to release-2.1 branch. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants