Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] Leader load balancing can cause CHECK failures if stepdown task is pending on next run #5181

Closed
bmatican opened this issue Jul 22, 2020 · 1 comment
Assignees
Labels
area/docdb YugabyteDB core features priority/high High Priority

Comments

@bmatican
Copy link
Contributor

Seems that if the LB issues a leader stepdown task in one run, but it's still pending in the next run, we could issue the same task again, leading to a CHECK failure.

Relevant stack

    @     0x7f92a656b6fc  yb::LogFatalHandlerSink::send()
    @     0x7f92a5752346  google::LogMessage::SendToLog()
    @     0x7f92a574f7aa  google::LogMessage::Flush()
    @     0x7f92a5752879  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f92b0a6c8a7  yb::master::ClusterLoadBalancer::SendReplicaChanges()
    @     0x7f92b0a684d8  yb::master::ClusterLoadBalancer::MoveLeader()
    @     0x7f92b0b74f8a  yb::master::enterprise::ClusterLoadBalancer::HandleLeaderMoves()
    @     0x7f92b0a6e5a0  yb::master::ClusterLoadBalancer::RunLoadBalancer()
    @     0x7f92b0b7608e  yb::master::enterprise::ClusterLoadBalancer::RunLoadBalancer()
    @     0x7f92b0a5d4bb  yb::master::CatalogManagerBgTasks::Run()
    @     0x7f92a65fae0f  yb::Thread::SuperviseThread()
    @     0x7f92a1d3a694  start_thread
    @     0x7f92a147741d  __clone
    @              (nil)  (unknown)```

cc @rahuldesirazu 
@bmatican bmatican added area/docdb YugabyteDB core features priority/high High Priority labels Jul 22, 2020
hulien22 added a commit that referenced this issue Jul 27, 2020
…ss to be made across tables (#5021) (#5181)

Summary:
Adding new flag `load_balancer_max_concurrent_moves_per_table` to limit the number of leader moves per table. This flag is meant to be used with `load_balancer_max_concurrent_moves` in order to improve performance for these moves.
Also fixing issue #5181, where having pending leader moves on subsequent LB runs can lead to the same tablet being told to move twice, thus leading to a check failure.  This was caused from `AnalyzeTabletsUnlocked` not properly updating the state for leader stepdowns, and has been fixed by storing new_leader_uuid instead of change_config_ts_uuid for pending leader stepdown tasks.

Test Plan:
`ybd --cxx-test load_balancer_multi_table-test` -- test for load_balancer_max_concurrent_moves_per_table
`ybd --cxx-test load_balancer-test --gtest_filter LoadBalancerTest.PendingLeaderStepdownRegressTest` -- regression test for issue #5181

Reviewers: hector, bogdan, rahuldesirazu

Reviewed By: rahuldesirazu

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8903
@hulien22
Copy link
Contributor

Closed in 6382fde.

hulien22 added a commit that referenced this issue Aug 4, 2020
…e allowing progress to be made across tables (#5021) (#5181)

Summary:
Adding new flag `load_balancer_max_concurrent_moves_per_table` to limit the number of leader moves per table. This flag is meant to be used with `load_balancer_max_concurrent_moves` in order to improve performance for these moves.
Also fixing issue #5181, where having pending leader moves on subsequent LB runs can lead to the same tablet being told to move twice, thus leading to a check failure.  This was caused from `AnalyzeTabletsUnlocked` not properly updating the state for leader stepdowns, and has been fixed by storing new_leader_uuid instead of change_config_ts_uuid for pending leader stepdown tasks.

Test Plan: Jenkins: rebase: 2.1

Reviewers: bogdan, rahuldesirazu

Reviewed By: rahuldesirazu

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9082
hulien22 added a commit that referenced this issue Aug 4, 2020
…e allowing progress to be made across tables (#5021) (#5181)

Summary:
Adding new flag `load_balancer_max_concurrent_moves_per_table` to limit the number of leader moves per table. This flag is meant to be used with `load_balancer_max_concurrent_moves` in order to improve performance for these moves.
Also fixing issue #5181, where having pending leader moves on subsequent LB runs can lead to the same tablet being told to move twice, thus leading to a check failure.  This was caused from `AnalyzeTabletsUnlocked` not properly updating the state for leader stepdowns, and has been fixed by storing new_leader_uuid instead of change_config_ts_uuid for pending leader stepdown tasks.

Test Plan: Jenkins: rebase: 2.2

Reviewers: bogdan, rahuldesirazu

Reviewed By: rahuldesirazu

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9083
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features priority/high High Priority
Projects
None yet
Development

No branches or pull requests

2 participants