-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoadBalancer gets stuck on delete tablet task #1655
Comments
The impact of this is that, if a node stops/dies, for the next 1h (our default YB-level task timeout), any YW operations that need to wait on the load being balanced (such as expanding the cluster) will have to wait until this timeout passes. @rajukumaryb Seems like we made the LB api eagerly block on any tasks, which include DeleteTablet tasks, instead of just Add/Remove ones. In the case of a dead node, after 5min when it gets bounced from Raft, the CatalogManager will detect the difference between the last Raft group it had and the new one and issue DeleteTablet calls to the dead node...However, if truly dead, these tasks will never succeed! @rahuldesirazu do you remember if there's any reason why we set the timeout for the async tasks to be 1h? I thought we also had a limit on number of repeat calls? How come this is not kicking in? Given that master will end up re-issuing these DeleteTablet calls if the TS ever checks back in mistakenly, I think lowering the timeout here is good anyway! As for the LB API itself, should we change that to explicitly only care about Add/Remove (maybe even LeaderStepdown) tasks? cc @rajukumaryb @rahuldesirazu |
@bmatican Does the CM issue a delete call to the dead node after 5 min? I would think the LB would see the tablet as under-replicated and only issue an add to a new peer. Not sure why we set a 1 hour limit for master async tasks. Most calls have a limit on number of calls, except for delete and create tasks:
Would be good to understand both why create/delete don't have a hard cap on num retries and why the timeout is 1 hr. From what I understand, it would be fine to lower this number. In terms of changing the LB API, do callers of IsLoadBalanced not care if LB is doing a delete (only cares about Add/Remove server tasks)? If so might be fine to change behavior but could be confusing to have IsLoadBalanced calls that ignores delete tasks. cc: @Arnav15 |
It's not the LoadBalancer doing the DeleteTablet, but the CatalogManager. On any new heartbeat from a TS, if a config is reported with different peers than what was previously recorded, we send DeleteTablet requests to any peers that are no longer there. For the TaskType create/delete, I'm also not 100% sure...However, a lower global timeout, of say 10-15m, would probably make that not relevant anymore anyway.. Finally, for the LB APIs, technically, delete is relevant but indirectly, since a DeleteTablet in flight means that the TSDescriptor likely has a pending delete, which means the LB will be stuck anyway... So the more I think about it, the more just lowering the timeout of these DeleteTablet calls feels most ok... Might even be ok to consider making DeleteTablets to DEAD servers stop retrying? cc @rahuldesirazu |
…utes Summary: Master async tasks get retried on errors or timeout. They have a cap on number of attempts, default 20, as well as total time taken, default 1h. Due to exponential backoff, capped at 1minute though, most tasks end up finishing within the 20 tries budget. However, tasks such as creating and deleting tablets, are set to not respect the count budget and instead use the time. We could investigate eventually removing that part of the logic all together, but for now, just lowering the max time to 15minutes felt like an appropriate measure. Test Plan: Jenkins Reviewers: rahuldesirazu, nicolas, hector Reviewed By: hector Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7655
The wait for load balancer task doesn't return in case there is a tablet being deleted on a dead server. We can:
The text was updated successfully, but these errors were encountered: