Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cooridinator: fix the issue caused adeadlock when deleting scheduler (#2625) #2637

Merged
merged 2 commits into from
Jul 14, 2020

Conversation

ti-srebot
Copy link
Contributor

cherry-pick #2625 to release-4.0


Signed-off-by: nolouch nolouch@gmail.com

What problem does this PR solve?

The PD service cannot serve after remove the scheduler.

What is changed and how it works?

The main reason:

goroutine 1:

  • To delete the schedule, call `func (c *RaftCluster) RemoveScheduler(name string)
  • Hold RaftCluster write lock
  • Wait for coordinator write lock

goroutine 2:

  • Background thread, collect statistics continuously, call func (c *coordinator) collectHotSpotMetrics()
  • Hold the coordinator read lock
  • Call s.Scheduler.(hasHotStatus).GetHotWriteStatus() to wait for hotScheduler read lock

goroutine3:

  • Scheduling process, scheduling, call func (h *hotScheduler) Schedule(cluster opt.Cluster)
  • Hold hotScheduler write lock
  • Call cluster.GetStoresStats() to wait for RaftCluster read lock

Check List

  • Manual test (add detailed scripts or steps below)

Release note

  • fix the issue that caused deadlock when deleting scheduler

Signed-off-by: nolouch <nolouch@gmail.com>
@ti-srebot
Copy link
Contributor Author

@nolouch please accept the invitation then you can push to the cherry-pick pull requests.
https://github.com/ti-srebot/pd/invitations

@lhy1024 lhy1024 added the require-LGT1 Indicates that the PR requires an LGTM. label Jul 14, 2020
Copy link
Contributor

@lhy1024 lhy1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added status/LGT1 Indicates that a PR has LGTM 1. and removed status/LGT2 Indicates that a PR has LGTM 2. labels Jul 14, 2020
@lhy1024
Copy link
Contributor

lhy1024 commented Jul 14, 2020

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Jul 14, 2020
@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@ti-srebot merge failed.

@lhy1024
Copy link
Contributor

lhy1024 commented Jul 14, 2020

/merge

@ti-srebot
Copy link
Contributor Author

Your auto merge job has been accepted, waiting for:

  • 2638

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot
Copy link
Contributor Author

@ti-srebot merge failed.

@lhy1024
Copy link
Contributor

lhy1024 commented Jul 14, 2020

/merge

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2020

Codecov Report

Merging #2637 into release-4.0 will decrease coverage by 0.00%.
The diff coverage is 50.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           release-4.0    #2637      +/-   ##
===============================================
- Coverage        77.16%   77.16%   -0.01%     
===============================================
  Files              205      205              
  Lines            22115    22116       +1     
===============================================
- Hits             17066    17065       -1     
  Misses            3747     3747              
- Partials          1302     1304       +2     
Impacted Files Coverage Δ
server/cluster/coordinator.go 74.31% <50.00%> (-2.40%) ⬇️
server/kv/etcd_kv.go 84.41% <0.00%> (-3.90%) ⬇️
server/member/leader.go 71.20% <0.00%> (-3.12%) ⬇️
server/schedulers/random_merge.go 61.19% <0.00%> (-2.99%) ⬇️
server/region_syncer/client.go 84.73% <0.00%> (-1.53%) ⬇️
server/schedule/operator_controller.go 81.41% <0.00%> (+0.16%) ⬆️
server/server.go 75.86% <0.00%> (+0.45%) ⬆️
server/grpc_service.go 59.84% <0.00%> (+1.15%) ⬆️
pkg/dashboard/adapter/manager.go 86.79% <0.00%> (+2.83%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 999331e...2f43afc. Read the comment docs.

@lhy1024 lhy1024 added the type/bugfix This PR fixes a bug. label Jul 14, 2020
Copy link
Member

@HunDunDM HunDunDM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot removed the status/LGT1 Indicates that a PR has LGTM 1. label Jul 14, 2020
@ti-srebot ti-srebot added status/LGT2 Indicates that a PR has LGTM 2. status/LGT3 The PR has already had 3 LGTM. labels Jul 14, 2020
@lhy1024
Copy link
Contributor

lhy1024 commented Jul 14, 2020

/merge

@ti-srebot
Copy link
Contributor Author

Your auto merge job has been accepted, waiting for:

  • 2640
  • 2635
  • 2636

@ti-srebot
Copy link
Contributor Author

/run-all-tests

@ti-srebot ti-srebot merged commit 2d4d389 into tikv:release-4.0 Jul 14, 2020
@zz-jason zz-jason deleted the release-4.0-e21e8dd64127 branch August 9, 2020 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
require-LGT1 Indicates that the PR requires an LGTM. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/LGT3 The PR has already had 3 LGTM. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants