-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A lot of failed jobs in AiiDA 1.5 (DuplicateSubcriber) when machine is overloaded #4598
Comments
Hi Giovanni, This does indeed look like a possible manifestation of #3973. Could you also provide a relevant portion of your RMQ log? I presume you'll see the hearbeat misses causing disconnections. As a starting hypothesis one thing that could be happening is that the communications thread (responsible for both sending the 'submit' messages and responding to heartbeats) is blocked submitting and doesn't get a chance to respond to hearbeats. I would have to look at the code but this seems a little strange because I'm pretty sure there are As for the |
@giovannipizzi this is a wild stab in the dark, but is there any chance that you are running multiple AiiDA profiles on this machine? Not necessarily even in the same virtual environment but just on the same machine? Are we sure that they all use different |
Yes, in different venvs, but I double-checked and they have different UUIDs:
Also, I'm quite sure at least some of the problems appeared where only one of the profiles was actually doing something. |
@muhrin indeed there are a few missed heartbeats in there: Let me know if you need more logs (and which ones). As I mentioned, at some point I submitted really a lot of calculations and my computer got very slow (tens of seconds even just to do Anyway
BTW, now I restarted the daemon - but for next time, is there a way to know "live", before restarting the daemon, what listeners exist in running daemons? I.e., to discover who's using all that memory, which processes they're waiting upon, ...? |
I think this is related to my proposal in #4595 (comment), i.e. this exception should result in a |
I closed this through #5715 because it may solve at least part of these cases. Since these reports are very old, it is difficult to now to what extent the fix will work. It is very likely that the bug is still present, but will occur just less often. If someone comes across this bug in |
Duplicate of #3973
When submitting a lot of jobs in AiiDA 1.5.0, I get many failed (excepted) jobs with reports like the following one:
or
I think other people observed the same.
The other thing I noticed (not sure if a cause or a consequence) is that now
verdi process list
gives an empty list, but the daemon workers are (while not busy) still using a lot of memory:I don't know if this is a memory leak. What I can say is that I was submitting quite a lot of processes/workflows and my machine was under heavy stress. Maybe everything became so slow that the two heartbeats were missed?
@sphuber @muhrin @unkcpz @chrisjsewell
The text was updated successfully, but these errors were encountered: