after an emergency restart, flux doesn't know about user processes that are still running #6590

garlick · 2025-01-30T18:16:20Z

Problem: If flux is not shut down cleanly, running jobs can escape its control.

When flux restarts, it should detect any running jobs, prolog, epilog, or housekeeping systemd units, and not release those execution targets to the scheduler until they are clean.

Perhaps this could be handled by the resource module's "monitoring" subsystem.

garlick · 2025-02-01T03:16:27Z

It might be fairly easy to implement this by adding a broker.idle group that sdexec joins on each rank when it knows there are no systemd units running there.

Then maybe the resource module on rank 0 could require execution targets to join that group before they can be declared initially "up" to the scheduler. (Perhaps after that the group could just be ignored). And some tooling could be added to query those.

It would be nice if we could also query the root systemd instance for prolog/epilog/housekeeping units. I wonder if the sdbus module could be modified so it could be loaded twice, with an instance assigned to root and another assigned to user? (I vaguely remember that a regular user might not be able to authenticate to the root dbus but systemctl seems to be able to even though its not setuid.)

A quick strace of systemctl list-units run as a regular user shows it connecting to

connect(3, {sa_family=AF_UNIX, sun_path="/run/dbus/system_bus_socket"}, 30) = 0

Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system When systemd is enabled, the new sdmon module joins a 'sdmon.idle' on startup. If there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.idle instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590

Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system When systemd is enabled, the new sdmon module joins the 'sdmon.idle' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.idle instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590

Problem: nodes are not checked for untracked running work when a Flux instance starts up. This might happen, for example, if - job-exec deems job shell(s) unkillable - housekeeping/prolog/epilog gets stuck on a hung file system - the broker exits without proper shutdown When systemd is enabled, the new sdmon module joins the 'sdmon.online' broker group on startup. However, if there are any running flux units, this is delayed until those units are no longer running. Change the resource module so that it monitors sdmon.online instead of broker.online when systemd is enabled. This will withhold "busy" nodes from the scheduler until they become idle. Fixes flux-framework#6590

garlick linked a pull request Feb 7, 2025 that will close this issue

avoid scheduling jobs on compute nodes that are not cleaned up #6616

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

after an emergency restart, flux doesn't know about user processes that are still running #6590

after an emergency restart, flux doesn't know about user processes that are still running #6590

garlick commented Jan 30, 2025

garlick commented Feb 1, 2025 •

edited

Loading

after an emergency restart, flux doesn't know about user processes that are still running #6590

after an emergency restart, flux doesn't know about user processes that are still running #6590

Comments

garlick commented Jan 30, 2025

garlick commented Feb 1, 2025 • edited Loading

garlick commented Feb 1, 2025 •

edited

Loading