Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate dynamics early within a SLURM job when it runs out of time #335

Open
Alexsp32 opened this issue Mar 21, 2024 · 4 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@Alexsp32
Copy link
Member

Add a SLURM-aware callback that will terminate a trajectory when the associated SLURM job has less than a threshold time left.

Allow this threshold time to be set manually, in case any output functions that are calculated after termination take particularly long. (e.g. OutputPotentialEnergy)

This could allow us to restart simulations that ran out of time in future.

@Alexsp32 Alexsp32 added the enhancement New feature or request label Mar 21, 2024
@Alexsp32 Alexsp32 self-assigned this Mar 21, 2024
@Alexsp32 Alexsp32 changed the title Finish dynamics early within a SLURM job Terminate dynamics early within a SLURM job when it runs out of time. Mar 21, 2024
@Alexsp32 Alexsp32 changed the title Terminate dynamics early within a SLURM job when it runs out of time. Terminate dynamics early within a SLURM job when it runs out of time Mar 21, 2024
@Alexsp32
Copy link
Member Author

@jamesgardner1421
Copy link
Member

Interesting idea! If you're most interested in being able to continue trajectories that run out of time, it may be simpler to periodically write the restart data rather than trying to communicate with Slurm. Though the challenging part is probably implementing the restarts which would be needed for both approaches.

@reinimaurer1
Copy link
Member

yes, restarts would require some form of standardized definitions of things that are needed, which is tricky as different types of simulations need different things. @Alexsp32 has also been pushing that we start thinking about proper database integration, which would help this massively

@jamesgardner1421
Copy link
Member

I think the serialisation of the current simulation state wouldn't actually be too complicated as everything is structured into simulation + dynamics variables. The standard Julia utilities like jld2 would make it quite simple. Then a periodic callback could be used to output checkpoints throughout the simulation. This would work well in the case of a single trajectory but might get more complicated in the ensemble case.

It would be nice if the checkpoint restart format was equivalent to the input for a regular simulation, so there's no special logic for resuming.

What sort of database integration did you have in mind? For storing trajectory outcomes like sticking/desorption? Or for the trajectory data itself?
An interesting idea would be to have a database with an entry for each trajectory, storing the data needed to start/restart the trajectory, linking to the results/observable for that trajectory. That way the database would provide a unified place to set up your simulations, view their current status and extract the results. I'm not sure what an implementation of this would look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants