Terminate dynamics early within a SLURM job when it runs out of time #335

Alexsp32 · 2024-03-21T16:26:56Z

Add a SLURM-aware callback that will terminate a trajectory when the associated SLURM job has less than a threshold time left.

Allow this threshold time to be set manually, in case any output functions that are calculated after termination take particularly long. (e.g. OutputPotentialEnergy)

This could allow us to restart simulations that ran out of time in future.

The text was updated successfully, but these errors were encountered:

Alexsp32 · 2024-03-21T16:27:44Z

Leaving this here for implementation: https://stackoverflow.com/questions/75667638/having-a-slurm-job-check-how-long-until-itself-ends

jamesgardner1421 · 2024-03-22T17:58:47Z

Interesting idea! If you're most interested in being able to continue trajectories that run out of time, it may be simpler to periodically write the restart data rather than trying to communicate with Slurm. Though the challenging part is probably implementing the restarts which would be needed for both approaches.

reinimaurer1 · 2024-03-22T18:08:15Z

yes, restarts would require some form of standardized definitions of things that are needed, which is tricky as different types of simulations need different things. @Alexsp32 has also been pushing that we start thinking about proper database integration, which would help this massively

jamesgardner1421 · 2024-03-22T21:21:06Z

I think the serialisation of the current simulation state wouldn't actually be too complicated as everything is structured into simulation + dynamics variables. The standard Julia utilities like jld2 would make it quite simple. Then a periodic callback could be used to output checkpoints throughout the simulation. This would work well in the case of a single trajectory but might get more complicated in the ensemble case.

It would be nice if the checkpoint restart format was equivalent to the input for a regular simulation, so there's no special logic for resuming.

What sort of database integration did you have in mind? For storing trajectory outcomes like sticking/desorption? Or for the trajectory data itself?
An interesting idea would be to have a database with an entry for each trajectory, storing the data needed to start/restart the trajectory, linking to the results/observable for that trajectory. That way the database would provide a unified place to set up your simulations, view their current status and extract the results. I'm not sure what an implementation of this would look like.

Alexsp32 added the enhancement New feature or request label Mar 21, 2024

Alexsp32 self-assigned this Mar 21, 2024

Alexsp32 changed the title ~~Finish dynamics early within a SLURM job~~ Terminate dynamics early within a SLURM job when it runs out of time. Mar 21, 2024

Alexsp32 changed the title ~~Terminate dynamics early within a SLURM job when it runs out of time.~~ Terminate dynamics early within a SLURM job when it runs out of time Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate dynamics early within a SLURM job when it runs out of time #335

Terminate dynamics early within a SLURM job when it runs out of time #335

Alexsp32 commented Mar 21, 2024

Alexsp32 commented Mar 21, 2024

jamesgardner1421 commented Mar 22, 2024

reinimaurer1 commented Mar 22, 2024

jamesgardner1421 commented Mar 22, 2024

Terminate dynamics early within a SLURM job when it runs out of time #335

Terminate dynamics early within a SLURM job when it runs out of time #335

Comments

Alexsp32 commented Mar 21, 2024

Alexsp32 commented Mar 21, 2024

jamesgardner1421 commented Mar 22, 2024

reinimaurer1 commented Mar 22, 2024

jamesgardner1421 commented Mar 22, 2024