[docdb] Leader load balancing can cause CHECK failures if stepdown task is pending on next run #5181

bmatican · 2020-07-22T19:26:57Z

Seems that if the LB issues a leader stepdown task in one run, but it's still pending in the next run, we could issue the same task again, leading to a CHECK failure.

Relevant stack

    @     0x7f92a656b6fc  yb::LogFatalHandlerSink::send()
    @     0x7f92a5752346  google::LogMessage::SendToLog()
    @     0x7f92a574f7aa  google::LogMessage::Flush()
    @     0x7f92a5752879  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f92b0a6c8a7  yb::master::ClusterLoadBalancer::SendReplicaChanges()
    @     0x7f92b0a684d8  yb::master::ClusterLoadBalancer::MoveLeader()
    @     0x7f92b0b74f8a  yb::master::enterprise::ClusterLoadBalancer::HandleLeaderMoves()
    @     0x7f92b0a6e5a0  yb::master::ClusterLoadBalancer::RunLoadBalancer()
    @     0x7f92b0b7608e  yb::master::enterprise::ClusterLoadBalancer::RunLoadBalancer()
    @     0x7f92b0a5d4bb  yb::master::CatalogManagerBgTasks::Run()
    @     0x7f92a65fae0f  yb::Thread::SuperviseThread()
    @     0x7f92a1d3a694  start_thread
    @     0x7f92a147741d  __clone
    @              (nil)  (unknown)```

cc @rahuldesirazu

The text was updated successfully, but these errors were encountered:

…ss to be made across tables (#5021) (#5181) Summary: Adding new flag `load_balancer_max_concurrent_moves_per_table` to limit the number of leader moves per table. This flag is meant to be used with `load_balancer_max_concurrent_moves` in order to improve performance for these moves. Also fixing issue #5181, where having pending leader moves on subsequent LB runs can lead to the same tablet being told to move twice, thus leading to a check failure. This was caused from `AnalyzeTabletsUnlocked` not properly updating the state for leader stepdowns, and has been fixed by storing new_leader_uuid instead of change_config_ts_uuid for pending leader stepdown tasks. Test Plan: `ybd --cxx-test load_balancer_multi_table-test` -- test for load_balancer_max_concurrent_moves_per_table `ybd --cxx-test load_balancer-test --gtest_filter LoadBalancerTest.PendingLeaderStepdownRegressTest` -- regression test for issue #5181 Reviewers: hector, bogdan, rahuldesirazu Reviewed By: rahuldesirazu Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8903

hulien22 · 2020-07-28T18:46:36Z

Closed in 6382fde.

…e allowing progress to be made across tables (#5021) (#5181) Summary: Adding new flag `load_balancer_max_concurrent_moves_per_table` to limit the number of leader moves per table. This flag is meant to be used with `load_balancer_max_concurrent_moves` in order to improve performance for these moves. Also fixing issue #5181, where having pending leader moves on subsequent LB runs can lead to the same tablet being told to move twice, thus leading to a check failure. This was caused from `AnalyzeTabletsUnlocked` not properly updating the state for leader stepdowns, and has been fixed by storing new_leader_uuid instead of change_config_ts_uuid for pending leader stepdown tasks. Test Plan: Jenkins: rebase: 2.1 Reviewers: bogdan, rahuldesirazu Reviewed By: rahuldesirazu Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9082

…e allowing progress to be made across tables (#5021) (#5181) Summary: Adding new flag `load_balancer_max_concurrent_moves_per_table` to limit the number of leader moves per table. This flag is meant to be used with `load_balancer_max_concurrent_moves` in order to improve performance for these moves. Also fixing issue #5181, where having pending leader moves on subsequent LB runs can lead to the same tablet being told to move twice, thus leading to a check failure. This was caused from `AnalyzeTabletsUnlocked` not properly updating the state for leader stepdowns, and has been fixed by storing new_leader_uuid instead of change_config_ts_uuid for pending leader stepdown tasks. Test Plan: Jenkins: rebase: 2.2 Reviewers: bogdan, rahuldesirazu Reviewed By: rahuldesirazu Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9083

bmatican added area/docdb YugabyteDB core features priority/high High Priority labels Jul 22, 2020

bmatican assigned hulien22 Jul 22, 2020

hulien22 closed this as completed Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docdb] Leader load balancing can cause CHECK failures if stepdown task is pending on next run #5181

[docdb] Leader load balancing can cause CHECK failures if stepdown task is pending on next run #5181

bmatican commented Jul 22, 2020

hulien22 commented Jul 28, 2020

[docdb] Leader load balancing can cause CHECK failures if stepdown task is pending on next run #5181

[docdb] Leader load balancing can cause CHECK failures if stepdown task is pending on next run #5181

Comments

bmatican commented Jul 22, 2020

hulien22 commented Jul 28, 2020