-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the execmanager.submit_calculation
idempotent'ish
#3188
Make the execmanager.submit_calculation
idempotent'ish
#3188
Conversation
The `submit_calculation` is responsible for submitting a job, that was already uploaded to the remote machine, to the scheduler. There was a risk of the daemon submitting the same job twice. If the first time around it successfully submitted the job and set the job id as an attribute on the node, but then was interrupted (for example due to a shutdown) before it could transition the `CalcJob` to the `UPDATE` transport task, once reloaded the command would be executed again. This would not complain and simply submit the job again to the scheduler. This is both a waste of resources and can cause various complications. It is impossible to make the function fully idempotent, but by checking if the job id attribute is already set for the node at the beginning and returning it should minimize the risk of double submission.
b44b0da
to
caceba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good!
Another thing to check when we address #2496 is that the set_job_id call is atomic and stores asap.
as you say this is not enough if the daemon gets shut down before that call but after submission.
The only option to make this really idempotent is to delegate the check to the scheduler, but this is beyond the scope of this PR here.
Eg have a wrapper for squeue that submits only once per folder (e. g. storing the jobid on a file in the folder, say _aiida_jobid.txt, and before submission it checks if it exists - if it's there it returns it, otherwise it proxies to squeue.
Not for now, but would it be possible/easy to get from slurm all the job id's you have submitted, say, in the past 1 week, and check if any of them is unknown to AiiDA? (e. g. meaning that even with this fix we would have run the simulation twice)?
aybe this can be tested also after this PR is merged.
see e.g. slurm commands here https://ubccr.freshdesk.com/support/solutions/articles/5000686909-how-to-retrieve-job-history-and-accounting |
Should be easy enough. To get a unique list of all job ids for given user since given data:
this can then be compared with the job ids stored in the attributes. |
) The `submit_calculation` is responsible for submitting a job, that was already uploaded to the remote machine, to the scheduler. There was a risk of the daemon submitting the same job twice. If the first time around it successfully submitted the job and set the job id as an attribute on the node, but then was interrupted (for example due to a shutdown) before it could transition the `CalcJob` to the `UPDATE` transport task, once reloaded the command would be executed again. This would not complain and simply submit the job again to the scheduler. This is both a waste of resources and can cause various complications. It is impossible to make the function fully idempotent, but by checking if the job id attribute is already set for the node at the beginning and returning it should minimize the risk of double submission.
) The `submit_calculation` is responsible for submitting a job, that was already uploaded to the remote machine, to the scheduler. There was a risk of the daemon submitting the same job twice. If the first time around it successfully submitted the job and set the job id as an attribute on the node, but then was interrupted (for example due to a shutdown) before it could transition the `CalcJob` to the `UPDATE` transport task, once reloaded the command would be executed again. This would not complain and simply submit the job again to the scheduler. This is both a waste of resources and can cause various complications. It is impossible to make the function fully idempotent, but by checking if the job id attribute is already set for the node at the beginning and returning it should minimize the risk of double submission.
Fixes #3183
The
submit_calculation
is responsible for submitting a job, that wasalready uploaded to the remote machine, to the scheduler. There was a
risk of the daemon submitting the same job twice. If the first time
around it successfully submitted the job and set the job id as an
attribute on the node, but then was interrupted (for example due to a
shutdown) before it could transition the
CalcJob
to theUPDATE
transport task, once reloaded the command would be executed again. This
would not complain and simply submit the job again to the scheduler.
This is both a waste of resources and can cause various complications.
It is impossible to make the function fully idempotent, but by checking
if the job id attribute is already set for the node at the beginning and
returning it should minimize the risk of double submission.