Automatically handle tasks that might be stuck in launching status #2066

ssalinas · 2020-01-31T20:09:44Z

After implementing mesos master failover handling we've seen a few cases where a task will make it to the master but not be propagated to the agent before the master shuts down. This leaves us with a task that is launched in Singularity, but has no state on the master or agent. To compound this, with our newer faster reconnect, it's possible that we trigger our reconciliation before the master has finished building its full state, so the initial reconcile cycle may not pick up on the missed task.

This adds a step to the scheduler poller that will trigger an explicit reconciliation for any task in a launching state for more than 3 minutes (configurable). It will then trigger a re-check every 15s until it gets a status from the master. These should move the tasks to LOST eventually if they have not actually been launched. It also adds some logging to the offer cache, which we were not properly cleaning up upon master restart

Still TODO:

fix tests
clean up the recent reconcile timestamps so it doesn't grown forever

pschoenfelder · 2020-02-03T17:25:16Z

🚢

ssalinas added 5 commits January 31, 2020 14:54

Automatically handle tasks that might be stuck in launching status

fb05da3

fix tests + clean up map

ac04419

Additional logic for PublishSubject retries

ab3f636

fix merge conflicts with master

26efb7c

typo

ed2c8ed

ssalinas added the hs_staging label Feb 3, 2020

ssalinas added 4 commits February 3, 2020 09:18

tweaks

d95bb1e

move retries to client

d971f08

short circuit reconnect if previous attempt already worked

b1a2d50

clean up logging

5753145

ssalinas added the hs_qa label Feb 3, 2020

ssalinas added the hs_stable label Feb 4, 2020

ssalinas merged commit f15a155 into master Feb 6, 2020

ssalinas deleted the stalled_launches branch February 6, 2020 15:03

ssalinas added this to the 1.2.0 milestone Feb 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically handle tasks that might be stuck in launching status #2066

Automatically handle tasks that might be stuck in launching status #2066

ssalinas commented Jan 31, 2020 •

edited

Loading

pschoenfelder commented Feb 3, 2020

Automatically handle tasks that might be stuck in launching status #2066

Automatically handle tasks that might be stuck in launching status #2066

Conversation

ssalinas commented Jan 31, 2020 • edited Loading

pschoenfelder commented Feb 3, 2020

ssalinas commented Jan 31, 2020 •

edited

Loading