Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running several verdi run processes independently crashes due to UniquenessError when creating the auto groups #997

Closed
lekah opened this issue Dec 18, 2017 · 5 comments · Fixed by #3650

Comments

@lekah
Copy link
Contributor

lekah commented Dec 18, 2017

I found a case of a calculation going into SUBMISSIONFAILED with the following log-message:

*** 1739620 [Li27Na27C27O81-1000-branching-19-NVE-5-0]: SUBMISSIONFAILED
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 1 LOG MESSAGES:
+-> ERROR at 2017-12-17 20:29:05.260324+00:00
| Submission of calc 1739620 failed, check also the log file! Traceback: Traceback (most recent call last):
| File "/home/kahle/git/AiiDA/aiida_core-screening/aiida/daemon/execmanager.py", line 623, in submit_calc
| remotedata.store()
| File "/home/kahle/git/AiiDA/aiida_core-screening/aiida/orm/implementation/general/node.py", line 1571, in store
| name=group_name, type_string=VERDIAUTOGROUP_TYPE)[0]
| File "/home/kahle/git/AiiDA/aiida_core-screening/aiida/orm/implementation/general/group.py", line 149, in get_or_create
| bla = cls(*args, **kwargs).store(), True
| File "/home/kahle/git/AiiDA/aiida_core-screening/aiida/orm/implementation/django/group.py", line 127, in store
| raise UniquenessError("A group with the same name (and of the "
| UniquenessError: A group with the same name (and of the same type) already exists, unable to store

I suspect that the get_or_create method of the Group is not safe for multi-threaded/multi-processing environment, because the underlying implementation is not an 'INSERT IF' construct, but first a query, and then the storage, which can result in the Uniqueness being broken.
We need to make the get_or_create method threadsafe.
I also, what is the point of grouping submitted calculations? Is that a necessary feature?

@sphuber
Copy link
Contributor

sphuber commented Dec 18, 2017

The problem is that @lekah was launching his calculations through a function that was called with verdi run. The verdi run command has a -g option that is enabled by default and will use the Autogroup class to automatically assign all the created nodes in the execution to this group. If no specific value is passed for the group name, a name is generated based on the current time to the nearest second, e.g. Verdi autogroup on 2017-11-06 12:26:18. If the verdi run call is done in parallel, multiple processes may incur the autogroup call within the same second and both attempt to create the same group, leading to the UniquenessError.

The core problem is that the Group.get_or_create function is not thread-safe.
However, I also think the choice of making the -g option enable by default for verdi run is an odd choice. I never even knew this existed and I have tons of useless groups created in my database that you can only see if you actually query the database directly.

@sphuber sphuber modified the milestones: 1.0 release, v1.0.0 May 9, 2018
@giovannipizzi
Copy link
Member

Probably the solution is to document that AiiDA is not 'thread-safe'

@lekah lekah removed their assignment Dec 12, 2018
@sphuber
Copy link
Contributor

sphuber commented Mar 15, 2019

@giovannipizzi do you have any suggestion on where to put this in the documentation?

@giovannipizzi
Copy link
Member

Mmm... I'm not sure... I think this goes in hand with all the discussion we had about checking all other thread/process-safety (e.g. when storing attributes, extras, ...). I don't remember if we said this can happen for the 1.0?

Anyway, thinking back to this after a few months, I think that if we manage to fix the get_or_create function to avoid problems, it's the best solution (actually, in this case, it should be as simple as try/catching the UniquenessError in the group creation [for the autogroup only]) and if this happens, retry to create a new group with -NUM appended, incrementing NUM until when it succeeds (maybe with a large safety max, e.g. a few thousands?) [we need also to understand how to test it, though, an integration test is complex, but maybe we can think to a unit test where the groups are already there and we see if the right group name is generated].

@sphuber sphuber modified the milestones: v1.0.0, v1.1.0 Apr 3, 2019
@giovannipizzi giovannipizzi changed the title submission fails if several processes submit independently due to UniquenessError Running several verdi run processes independently crashes due to UniquenessError when creating the auto groups Dec 10, 2019
@giovannipizzi
Copy link
Member

giovannipizzi commented Dec 10, 2019

Small script to create 30 nodes "quickly":

#!/usr/bin/env runaiida
import os
import time
from datetime import datetime
from aiida.orm import Data

N = 30

current_pid = os.getpid()

print(current_pid)

nodes = []
for i in range(N):
    time.sleep(0.1)
    data = Data().store()
    nodes.append(data.pk)

print(nodes)

You can save the script above in a file spawn.py and then run it from bash as

(./spawn.py & ) ; ( ./spawn.py & )

so that they run in parallel. The second will crash with a traceback similar to the one above.
We can fix this, and the check how to make a test.

giovannipizzi added a commit to giovannipizzi/aiida-core that referenced this issue Apr 2, 2020
This also remove an overzelous isinstance check, and
moves additional checks in a cached function that is run only when
storing the very first node (that needs to be put in an autogroup),
making storing of nodes faster (even if times oscillates so it's hard
to estimate exactly by how much).

Also, added logic to allow for concurrent creation of multiple groups
(and test). This fixes aiidateam#997
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment