-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
12 hour inactivity shutdown #919
Conversation
@mojtaba-komeili this type of feature would have been useful for damage control on your issue (though hopefully #911 is helping on the exploration part of that front). |
test/core/test_operator.py
Outdated
@@ -261,6 +261,50 @@ def test_run_job_not_concurrent(self): | |||
assignment = task_run.get_assignments()[0] | |||
self.assertEqual(assignment.get_status(), AssignmentState.COMPLETED) | |||
|
|||
@patch("mephisto.operations.operator.RUN_STATUS_POLL_TIME", 1.5) | |||
def test_patience_shutdown(self): | |||
"""Ensure that a job can be run that doesn't require connected workers""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure id this is the right docstring for this test function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. It is needed a lot of times specially when running over weekends and there is hours of no activity.
Codecov ReportBase: 64.61% // Head: 64.61% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #919 +/- ##
=======================================
Coverage 64.61% 64.61%
=======================================
Files 108 108
Lines 9329 9335 +6
=======================================
+ Hits 6028 6032 +4
- Misses 3301 3303 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Overview
In the vein of #918 and trying to improve the negative impact of the long tail of failure cases, this feature sets a configurable
submission_patience
, which automatically shuts down a MephistoTaskRun
when no submissions have come through in (by default) 12 hours. This should capture when a task is available but can't be completed (due to a bug in the code, an onboarding that's impossible, pairing problems, etc) and take a task offline before it wastes workers' time.Implementation
no_submission_patience
argument toTaskRunArgs
last_submission_time
attribute to theClientIOHandler
which is updated every time a submission packet comes in.Operator._track_and_kill_runs
to also see if the patience is exceeded as an option for shutting down a task.Testing
Current automated tests pass, and a new test is added for this case too.