Make the `execmanager.retrieve_calculation` idempotent'ish #3142

sphuber · 2019-07-07T15:42:45Z

The retrieve_calculation would cause an exception if called multiple
times for the same calculations, which can happen if the first time that
the runner was working on it got interrupted, for example due to a
daemon shutdown. The reason is that the second time around the adding of
the retrieved folder data node will raise a uniqueness exception,
because there can only be one output with the same label.

Note that full idem-potency is impossible, but this change should make
the problem a lot less likely to occur. The idea is to delay the actual
attaching of the retrieved folder data node to the last moment possible.
This way, if the method is called again and the folder is already there,
we can be reasonably sure that the files were already retrieved
successfully and we simply return, leaving the call a no-op. This is done
in the beginning of the function to check if the output node already
exists using the LinkManager.first() call.

The LinkManager.first() method is adapted to, instead of raising a
ValueError when no entry is found at all, simply return None. This
is more consistent with the behavior of QueryBuilder.first().

The `retrieve_calculation` would cause an exception if called multiple times for the same calculations, which can happen if the first time that the runner was working on it got interrupted, for example due to a daemon shutdown. The reason is that the second time around the adding of the `retrieved` folder data node will raise a uniqueness exception, because there can only be one output with the same label. Note that full idempotency is impossible, but this change should make the problem a lot less likely to occurr. The idea is to delay the actual attaching of the retrieved folder data node to the last moment possible. This way, if the method is called again and the folder is already there, we can be reasonably sure that the files were already retrieved succesfully and we simply return, leaving the call a no-op. This is done in the beginning of the function to check if the output node already exists using the `LinkManager.first()` call. If the node exists, the retrieve function has apparently already been called before and reached the end of the function where it adds the retrieved folder. This means all the files were already successfully retrieved so we can safely skip. The `LinkManager.first()` method is adapted to, instead of raising a `ValueError` when no entry is found at all, simply return `None`. This is more consistent with the behavior of `QueryBuilder.first()`.

giovannipizzi

Ok - ideally, the final solution would be to make sure the add_incoming an atomic operation (wrapping it in a transaction?). I think we can approve this, maybe can you add a note to check/improve this in the issue where we should address atomicity, transactions etc?

I also agree with returning None in first(), this is consistence with SQLAlchemy's first() method vs. one() that instead raises

…_idempotence

…#3142) The `retrieve_calculation` would cause an exception if called multiple times for the same calculations, which can happen if the first time that the runner was working on it got interrupted, for example due to a daemon shutdown. The reason is that the second time around the adding of the `retrieved` folder data node will raise a uniqueness exception, because there can only be one output with the same label. Note that full idempotency is impossible, but this change should make the problem a lot less likely to occur. The idea is to delay the actual attaching of the retrieved folder data node to the last moment possible. This way, if the method is called again and the folder is already there, we can be reasonably sure that the files were already retrieved successfully and we simply return, leaving the call a no-op. This is done in the beginning of the function to check if the output node already exists using the `LinkManager.first()` call. If the node exists, the retrieve function has apparently already been called before and reached the end of the function where it adds the retrieved folder. This means all the files were already successfully retrieved so we can safely skip. The `LinkManager.first()` method is adapted to, instead of raising a `ValueError` when no entry is found at all, simply return `None`. This is more consistent with the behavior of `QueryBuilder.first()`.

sphuber requested a review from giovannipizzi July 7, 2019 15:42

sphuber force-pushed the fix_3141_execmanager_retrieve_calculation_idempotence branch from 19e3e84 to 81de7b2 Compare July 8, 2019 10:07

sphuber mentioned this pull request Jul 8, 2019

Make the execmanager.upload_calculation idempotent'ish #3146

Merged

giovannipizzi approved these changes Jul 9, 2019

View reviewed changes

Merge branch 'develop' into fix_3141_execmanager_retrieve_calculation…

af293a9

…_idempotence

sphuber merged commit 8447c6b into aiidateam:develop Jul 9, 2019

sphuber deleted the fix_3141_execmanager_retrieve_calculation_idempotence branch July 9, 2019 13:27

sphuber mentioned this pull request Oct 4, 2020

JobProcess task_retrieve_job is not idempotent which can cause failures #2265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the `execmanager.retrieve_calculation` idempotent'ish #3142

Make the `execmanager.retrieve_calculation` idempotent'ish #3142

sphuber commented Jul 7, 2019

giovannipizzi left a comment

Make the execmanager.retrieve_calculation idempotent'ish #3142

Make the execmanager.retrieve_calculation idempotent'ish #3142

Conversation

sphuber commented Jul 7, 2019

giovannipizzi left a comment

Choose a reason for hiding this comment

Make the `execmanager.retrieve_calculation` idempotent'ish #3142

Make the `execmanager.retrieve_calculation` idempotent'ish #3142