-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminate dynamics early within a SLURM job when it runs out of time #335
Comments
Leaving this here for implementation: https://stackoverflow.com/questions/75667638/having-a-slurm-job-check-how-long-until-itself-ends |
Interesting idea! If you're most interested in being able to continue trajectories that run out of time, it may be simpler to periodically write the restart data rather than trying to communicate with Slurm. Though the challenging part is probably implementing the restarts which would be needed for both approaches. |
yes, restarts would require some form of standardized definitions of things that are needed, which is tricky as different types of simulations need different things. @Alexsp32 has also been pushing that we start thinking about proper database integration, which would help this massively |
I think the serialisation of the current simulation state wouldn't actually be too complicated as everything is structured into simulation + dynamics variables. The standard Julia utilities like jld2 would make it quite simple. Then a periodic callback could be used to output checkpoints throughout the simulation. This would work well in the case of a single trajectory but might get more complicated in the ensemble case. It would be nice if the checkpoint restart format was equivalent to the input for a regular simulation, so there's no special logic for resuming. What sort of database integration did you have in mind? For storing trajectory outcomes like sticking/desorption? Or for the trajectory data itself? |
Add a SLURM-aware callback that will terminate a trajectory when the associated SLURM job has less than a threshold time left.
Allow this threshold time to be set manually, in case any output functions that are calculated after termination take particularly long. (e.g.
OutputPotentialEnergy
)This could allow us to restart simulations that ran out of time in future.
The text was updated successfully, but these errors were encountered: