Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically handle tasks that might be stuck in launching status #2066

Merged
merged 9 commits into from
Feb 6, 2020

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Jan 31, 2020

After implementing mesos master failover handling we've seen a few cases where a task will make it to the master but not be propagated to the agent before the master shuts down. This leaves us with a task that is launched in Singularity, but has no state on the master or agent. To compound this, with our newer faster reconnect, it's possible that we trigger our reconciliation before the master has finished building its full state, so the initial reconcile cycle may not pick up on the missed task.

This adds a step to the scheduler poller that will trigger an explicit reconciliation for any task in a launching state for more than 3 minutes (configurable). It will then trigger a re-check every 15s until it gets a status from the master. These should move the tasks to LOST eventually if they have not actually been launched. It also adds some logging to the offer cache, which we were not properly cleaning up upon master restart

Still TODO:

  • fix tests
  • clean up the recent reconcile timestamps so it doesn't grown forever

@ssalinas ssalinas added the hs_qa label Feb 3, 2020
@pschoenfelder
Copy link
Contributor

🚢

@ssalinas ssalinas merged commit f15a155 into master Feb 6, 2020
@ssalinas ssalinas deleted the stalled_launches branch February 6, 2020 15:03
@ssalinas ssalinas added this to the 1.2.0 milestone Feb 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants