Skip to content

Commit

Permalink
Make the execmanager.retrieve_calculation idempotent'ish (aiidateam…
Browse files Browse the repository at this point in the history
…#3142)

The `retrieve_calculation` would cause an exception if called multiple
times for the same calculations, which can happen if the first time that
the runner was working on it got interrupted, for example due to a
daemon shutdown. The reason is that the second time around the adding of
the `retrieved` folder data node will raise a uniqueness exception,
because there can only be one output with the same label.

Note that full idempotency is impossible, but this change should make
the problem a lot less likely to occur. The idea is to delay the actual
attaching of the retrieved folder data node to the last moment possible.
This way, if the method is called again and the folder is already there,
we can be reasonably sure that the files were already retrieved
successfully and we simply return, leaving the call a no-op. This is done
in the beginning of the function to check if the output node already
exists using the `LinkManager.first()` call. If the node exists, the
retrieve function has apparently already been called before and reached
the end of the function where it adds the retrieved folder. This means
all the files were already successfully retrieved so we can safely skip.

The `LinkManager.first()` method is adapted to, instead of raising a
`ValueError` when no entry is found at all, simply return `None`. This
is more consistent with the behavior of `QueryBuilder.first()`.
  • Loading branch information
sphuber authored and d-tomerini committed Sep 30, 2019
1 parent a54d7a5 commit d9e0c60
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 12 deletions.
28 changes: 19 additions & 9 deletions aiida/engine/daemon/execmanager.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@
from __future__ import division
from __future__ import print_function
from __future__ import absolute_import

import os

import warnings
from six.moves import zip

from aiida.common import AIIDA_LOGGER, exceptions
Expand Down Expand Up @@ -253,17 +253,22 @@ def retrieve_calculation(calculation, transport, retrieved_temporary_folder):
listed, if any, in the `retrieved_temporary_folder` of the jobs CalcInfo
"""
logger_extra = get_dblogger_extra(calculation)

execlogger.debug("Retrieving calc {}".format(calculation.pk), extra=logger_extra)
workdir = calculation.get_remote_workdir()

execlogger.debug(
"[retrieval of calc {}] chdir {}".format(calculation.pk, workdir),
extra=logger_extra)

# Create the FolderData node to attach everything to
execlogger.debug("Retrieving calc {}".format(calculation.pk), extra=logger_extra)
execlogger.debug("[retrieval of calc {}] chdir {}".format(calculation.pk, workdir), extra=logger_extra)

# If the calculation already has a `retrieved` folder, simply return. The retrieval was apparently already completed
# before, which can happen if the daemon is restarted and it shuts down after retrieving but before getting the
# chance to perform the state transition. Upon reloading this calculation, it will re-attempt the retrieval.
link_label = calculation.link_label_retrieved
if calculation.get_outgoing(FolderData, link_label_filter=link_label).first():
execlogger.warning('CalcJobNode<{}> already has a `{}` output folder: skipping retrieval'.format(
calculation.pk, link_label))
return

# Create the FolderData node into which to store the files that are to be retrieved
retrieved_files = FolderData()
retrieved_files.add_incoming(calculation, link_type=LinkType.CREATE, link_label=calculation.link_label_retrieved)

with transport:
transport.chdir(workdir)
Expand Down Expand Up @@ -300,6 +305,11 @@ def retrieve_calculation(calculation, transport, retrieved_temporary_folder):
extra=logger_extra)
retrieved_files.store()

# Make sure that attaching the `retrieved` folder with a link is the last thing we do. This gives the biggest chance
# of making this method idempotent. That is to say, if a runner get's interrupted during this action, it will simply
# retry the retrieval, unless we got here and managed to link it up, in which case we move to the next task.
retrieved_files.add_incoming(calculation, link_type=LinkType.CREATE, link_label=calculation.link_label_retrieved)


def kill_calculation(calculation, transport):
"""
Expand Down
6 changes: 3 additions & 3 deletions aiida/orm/utils/links.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,12 +222,12 @@ def one(self):
def first(self):
"""Return the first entry from the iterator.
:return: LinkTriple instance
:raises ValueError: if the iterator contains anything but one entry
:return: LinkTriple instance or None if no entries were matched
"""
if self.link_triples:
return self.link_triples[0]
raise ValueError('no entries found')

return None

def all(self):
"""Return all entries from the list.
Expand Down

0 comments on commit d9e0c60

Please sign in to comment.