Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadBalancer gets stuck on delete tablet task #1655

Closed
Arnav15 opened this issue Jun 28, 2019 · 3 comments
Closed

LoadBalancer gets stuck on delete tablet task #1655

Arnav15 opened this issue Jun 28, 2019 · 3 comments
Assignees
Labels
area/docdb YugabyteDB core features

Comments

@Arnav15
Copy link
Contributor

Arnav15 commented Jun 28, 2019

The wait for load balancer task doesn't return in case there is a tablet being deleted on a dead server. We can:

  1. Make the loadbalancer only track add and remove, and not delete.
  2. Not issue delete tablet commands to a dead tserver.
  3. Lower the timeout to attempt the delete commands instead of an hour.
@bmatican
Copy link
Contributor

The impact of this is that, if a node stops/dies, for the next 1h (our default YB-level task timeout), any YW operations that need to wait on the load being balanced (such as expanding the cluster) will have to wait until this timeout passes.

@rajukumaryb Seems like we made the LB api eagerly block on any tasks, which include DeleteTablet tasks, instead of just Add/Remove ones.

In the case of a dead node, after 5min when it gets bounced from Raft, the CatalogManager will detect the difference between the last Raft group it had and the new one and issue DeleteTablet calls to the dead node...However, if truly dead, these tasks will never succeed!

@rahuldesirazu do you remember if there's any reason why we set the timeout for the async tasks to be 1h? I thought we also had a limit on number of repeat calls? How come this is not kicking in? Given that master will end up re-issuing these DeleteTablet calls if the TS ever checks back in mistakenly, I think lowering the timeout here is good anyway!

As for the LB API itself, should we change that to explicitly only care about Add/Remove (maybe even LeaderStepdown) tasks? cc @rajukumaryb @rahuldesirazu

@rahuldesirazu
Copy link
Contributor

rahuldesirazu commented Jul 1, 2019

@bmatican Does the CM issue a delete call to the dead node after 5 min? I would think the LB would see the tablet as under-replicated and only issue an add to a new peer. Not sure why we set a 1 hour limit for master async tasks. Most calls have a limit on number of calls, except for delete and create tasks:

  bool RetryLimitTaskType() {
    return type() != ASYNC_CREATE_REPLICA && type() != ASYNC_DELETE_REPLICA;
  }

Would be good to understand both why create/delete don't have a hard cap on num retries and why the timeout is 1 hr. From what I understand, it would be fine to lower this number. In terms of changing the LB API, do callers of IsLoadBalanced not care if LB is doing a delete (only cares about Add/Remove server tasks)? If so might be fine to change behavior but could be confusing to have IsLoadBalanced calls that ignores delete tasks.

cc: @Arnav15

@bmatican
Copy link
Contributor

bmatican commented Jul 1, 2019

It's not the LoadBalancer doing the DeleteTablet, but the CatalogManager. On any new heartbeat from a TS, if a config is reported with different peers than what was previously recorded, we send DeleteTablet requests to any peers that are no longer there.

For the TaskType create/delete, I'm also not 100% sure...However, a lower global timeout, of say 10-15m, would probably make that not relevant anymore anyway..

Finally, for the LB APIs, technically, delete is relevant but indirectly, since a DeleteTablet in flight means that the TSDescriptor likely has a pending delete, which means the LB will be stuck anyway...

So the more I think about it, the more just lowering the timeout of these DeleteTablet calls feels most ok...

Might even be ok to consider making DeleteTablets to DEAD servers stop retrying? cc @rahuldesirazu

@rahuldesirazu rahuldesirazu removed their assignment Aug 20, 2019
bmatican added a commit that referenced this issue Dec 7, 2019
…utes

Summary:
Master async tasks get retried on errors or timeout. They have a cap on number of
attempts, default 20, as well as total time taken, default 1h. Due to exponential backoff, capped
at 1minute though, most tasks end up finishing within the 20 tries budget. However, tasks such as
creating and deleting tablets, are set to not respect the count budget and instead use the time.

We could investigate eventually removing that part of the logic all together, but for now, just
lowering the max time to 15minutes felt like an appropriate measure.

Test Plan: Jenkins

Reviewers: rahuldesirazu, nicolas, hector

Reviewed By: hector

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7655
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features
Projects
None yet
Development

No branches or pull requests

4 participants