Automatically handle tasks that might be stuck in launching status #2066
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After implementing mesos master failover handling we've seen a few cases where a task will make it to the master but not be propagated to the agent before the master shuts down. This leaves us with a task that is launched in Singularity, but has no state on the master or agent. To compound this, with our newer faster reconnect, it's possible that we trigger our reconciliation before the master has finished building its full state, so the initial reconcile cycle may not pick up on the missed task.
This adds a step to the scheduler poller that will trigger an explicit reconciliation for any task in a launching state for more than 3 minutes (configurable). It will then trigger a re-check every 15s until it gets a status from the master. These should move the tasks to LOST eventually if they have not actually been launched. It also adds some logging to the offer cache, which we were not properly cleaning up upon master restart
Still TODO: