Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: try to replace the down replicas instead of removing directly #1222

Merged
merged 6 commits into from
Sep 4, 2018

Conversation

rleungx
Copy link
Member

@rleungx rleungx commented Aug 29, 2018

What problem does this PR solve?

At present, if a store is down, it will remove peers in the store directly, and then make up replicas. It usually needs some time because it will traverse all the regions.

What is changed and how it works?

This PR tries to move the peers to another store if the original store is down.
Besides, this PR adds a case in the simulator to simulate this kind of situation. It will print an error log without these changes: making up replicas don't start immediately.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test

if down && sum < 1200 {
// only need to print once
down = false
simutil.Logger.Error("making up replicas don't start immediately")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we abstract a statistic to evaluate different strategies instead of just print a log message? For example, we may count the number of regions with insufficient replicas in the scheduling process.

@@ -170,7 +172,8 @@ func (r *ReplicaChecker) checkDownPeer(region *core.RegionInfo) *Operator {
if stats.GetDownSeconds() < uint64(r.cluster.GetMaxStoreDownTime().Seconds()) {
continue
}
return CreateRemovePeerOperator("removeDownReplica", r.cluster, OpReplica, region, peer.GetStoreId())

return r.handleReplica(region, peer, "Down")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replacePeer or fixPeer.

@@ -251,3 +232,33 @@ func (r *ReplicaChecker) checkBestReplacement(region *core.RegionInfo) *Operator
checkerCounter.WithLabelValues("replica_checker", "new_operator").Inc()
return CreateMovePeerOperator("moveToBetterLocation", r.cluster, region, OpReplica, oldPeer.GetStoreId(), newPeer.GetStoreId(), newPeer.GetId())
}

func (r *ReplicaChecker) handleReplica(region *core.RegionInfo, peer *metapb.Peer, status string) *Operator {
removeExtra := fmt.Sprintf("%s%s%s", "removeExtra", status, "Replica")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about fmt.Sprintf("removeExtra%sReplica", status)?

@@ -251,3 +232,33 @@ func (r *ReplicaChecker) checkBestReplacement(region *core.RegionInfo) *Operator
checkerCounter.WithLabelValues("replica_checker", "new_operator").Inc()
return CreateMovePeerOperator("moveToBetterLocation", r.cluster, region, OpReplica, oldPeer.GetStoreId(), newPeer.GetStoreId(), newPeer.GetId())
}

func (r *ReplicaChecker) fixPeer(region *core.RegionInfo, peer *metapb.Peer, status string) *Operator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does fixPeer mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means this peer needs to be removed in order to be made up in another store or just simply be replaced. The name before addressing the comment is handleReplica. There could be a better naming about this.

@disksing
Copy link
Contributor

disksing commented Sep 4, 2018

PTAL @nolouch @siddontang

Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nolouch nolouch merged commit ba22a06 into tikv:master Sep 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants