-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
after an emergency restart, flux doesn't know about user processes that are still running #6590
Comments
It might be fairly easy to implement this by adding a Then maybe the resource module on rank 0 could require execution targets to join that group before they can be declared initially "up" to the scheduler. (Perhaps after that the group could just be ignored). And some tooling could be added to query those. It would be nice if we could also query the root systemd instance for prolog/epilog/housekeeping units. I wonder if the A quick strace of
|
Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system When systemd is enabled, the new sdmon module joins a 'sdmon.idle' on startup. If there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.idle instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590
Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system When systemd is enabled, the new sdmon module joins the 'sdmon.idle' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.idle instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590
Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system When systemd is enabled, the new sdmon module joins the 'sdmon.idle' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.idle instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590
Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system - the broker exits without proper shutdown When systemd is enabled, the new sdmon module joins the 'sdmon.online' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.online instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590
Problem: If flux is not shut down cleanly, running jobs can escape its control.
When flux restarts, it should detect any running jobs, prolog, epilog, or housekeeping systemd units, and not release those execution targets to the scheduler until they are clean.
Perhaps this could be handled by the resource module's "monitoring" subsystem.
The text was updated successfully, but these errors were encountered: