LoadBalancer gets stuck on delete tablet task #1655

Arnav15 · 2019-06-28T21:09:18Z

The wait for load balancer task doesn't return in case there is a tablet being deleted on a dead server. We can:

Make the loadbalancer only track add and remove, and not delete.
Not issue delete tablet commands to a dead tserver.
Lower the timeout to attempt the delete commands instead of an hour.

bmatican · 2019-06-29T01:45:46Z

The impact of this is that, if a node stops/dies, for the next 1h (our default YB-level task timeout), any YW operations that need to wait on the load being balanced (such as expanding the cluster) will have to wait until this timeout passes.

@rajukumaryb Seems like we made the LB api eagerly block on any tasks, which include DeleteTablet tasks, instead of just Add/Remove ones.

In the case of a dead node, after 5min when it gets bounced from Raft, the CatalogManager will detect the difference between the last Raft group it had and the new one and issue DeleteTablet calls to the dead node...However, if truly dead, these tasks will never succeed!

@rahuldesirazu do you remember if there's any reason why we set the timeout for the async tasks to be 1h? I thought we also had a limit on number of repeat calls? How come this is not kicking in? Given that master will end up re-issuing these DeleteTablet calls if the TS ever checks back in mistakenly, I think lowering the timeout here is good anyway!

As for the LB API itself, should we change that to explicitly only care about Add/Remove (maybe even LeaderStepdown) tasks? cc @rajukumaryb @rahuldesirazu

rahuldesirazu · 2019-07-01T18:44:17Z

@bmatican Does the CM issue a delete call to the dead node after 5 min? I would think the LB would see the tablet as under-replicated and only issue an add to a new peer. Not sure why we set a 1 hour limit for master async tasks. Most calls have a limit on number of calls, except for delete and create tasks:

  bool RetryLimitTaskType() {
    return type() != ASYNC_CREATE_REPLICA && type() != ASYNC_DELETE_REPLICA;
  }

Would be good to understand both why create/delete don't have a hard cap on num retries and why the timeout is 1 hr. From what I understand, it would be fine to lower this number. In terms of changing the LB API, do callers of IsLoadBalanced not care if LB is doing a delete (only cares about Add/Remove server tasks)? If so might be fine to change behavior but could be confusing to have IsLoadBalanced calls that ignores delete tasks.

cc: @Arnav15

bmatican · 2019-07-01T23:13:47Z

It's not the LoadBalancer doing the DeleteTablet, but the CatalogManager. On any new heartbeat from a TS, if a config is reported with different peers than what was previously recorded, we send DeleteTablet requests to any peers that are no longer there.

For the TaskType create/delete, I'm also not 100% sure...However, a lower global timeout, of say 10-15m, would probably make that not relevant anymore anyway..

Finally, for the LB APIs, technically, delete is relevant but indirectly, since a DeleteTablet in flight means that the TSDescriptor likely has a pending delete, which means the LB will be stuck anyway...

So the more I think about it, the more just lowering the timeout of these DeleteTablet calls feels most ok...

Might even be ok to consider making DeleteTablets to DEAD servers stop retrying? cc @rahuldesirazu

…utes Summary: Master async tasks get retried on errors or timeout. They have a cap on number of attempts, default 20, as well as total time taken, default 1h. Due to exponential backoff, capped at 1minute though, most tasks end up finishing within the 20 tries budget. However, tasks such as creating and deleting tablets, are set to not respect the count budget and instead use the time. We could investigate eventually removing that part of the logic all together, but for now, just lowering the max time to 15minutes felt like an appropriate measure. Test Plan: Jenkins Reviewers: rahuldesirazu, nicolas, hector Reviewed By: hector Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7655

Arnav15 added the area/docdb YugabyteDB core features label Jun 28, 2019

Arnav15 assigned bmatican Jun 28, 2019

Arnav15 assigned rahuldesirazu Jun 28, 2019

bmatican assigned rajukumaryb Jun 29, 2019

rahuldesirazu removed their assignment Aug 20, 2019

bmatican closed this as completed Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoadBalancer gets stuck on delete tablet task #1655

LoadBalancer gets stuck on delete tablet task #1655

Arnav15 commented Jun 28, 2019

bmatican commented Jun 29, 2019

rahuldesirazu commented Jul 1, 2019 •

edited

Loading

bmatican commented Jul 1, 2019

LoadBalancer gets stuck on delete tablet task #1655

LoadBalancer gets stuck on delete tablet task #1655

Comments

Arnav15 commented Jun 28, 2019

bmatican commented Jun 29, 2019

rahuldesirazu commented Jul 1, 2019 • edited Loading

bmatican commented Jul 1, 2019

rahuldesirazu commented Jul 1, 2019 •

edited

Loading