-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine: Do not let DuplicateSubcriberError
except a Process
#5715
Conversation
@ltalirz @muhrin @giovannipizzi I think this is an important fix to ship with the next release. Although it doesn't fully address (part) of the real problem of process tasks getting lost sometimes, it does solve the most important symptom of processes getting randomly excepted when the daemon workers are under heavy load. This should hopefully contribute a lot to the stability of AiiDA. |
Thanks @sphuber for looking into these stability issues, in my view these are the most important problems to solve in AiiDA at the moment. I understand that your fix addresses the problem when the user manually interferes and resubmits a task although the original was still running [1] However, when workers are missing heartbeats although they are "still working properly in some sense", I guess the problem we need to address is those missing heartbeats. Do we have an idea what might be causing this? If I understand correctly, with your current fix, rabbitmq would continue to requeue those tasks whenever the next heartbeat is missed, and the new tasks would then continue to be dismissed. [1] By the way, why would they do that? Is it that the daemon worker who took the task is overloaded, and so never actually gets to do the task, which can then let the process e.g. be stuck in |
Not just that, it fixes a process excepting whenever a second worker tries to run it, for whatever reason. A user manually interfering is just one known pathway to this situation.
The problem is that there is some pathway where process tasks can get lost. I have seen people report this happening after restarting the machine or the daemon or some weird situation happened, but what the exact cause is, is not clear. If the task gets lost, the process will never be resolved. This can happen at any point in the lifetime of the process and is not exclusive to newly created processes. The workaround is to recreate the task, which is done with the
At some point, it was easy for a daemon worker to miss responding to heartbeats, because if it was running for example an IO-task (retrieving large files of completed I think the current situation then is maybe not so much the heartbeat, but this other "unknown" pathway that can get process tasks to be lost. Some hard shutdown of the daemon, RabbitMQ or machine running either of those. It is simply not known. Long story short, we have stayed inactive for far too long, and not submitting a fix, because we couldn't find the real cause of the problem. However, I think we should now have a practical workaround that will at least prevent users from losing their processes to random exceptions. I am working on additional functionality that will make it easier to detect whether a task actually still exists. Once we have this, we can reliably detect missing tasks and suggest users to safely use
No, with this change, as soon as the exception is hit, the task is acknowledged and discarded. We therefore do indeed assume that the original runner will still eventually complete the process. This might not be the case, but in that case, we simply have the situation where the process task is actually lost and we can revive it. As said before, I am working on additional tools to make this analysis more robust so we can hopefully give bullet proof advice to the user on when to revive a process or not. |
b3c35d1
to
22f136b
Compare
Codecov ReportBase: 79.68% // Head: 79.69% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #5715 +/- ##
==========================================
+ Coverage 79.68% 79.69% +0.02%
==========================================
Files 491 491
Lines 36668 36671 +3
==========================================
+ Hits 29215 29222 +7
+ Misses 7453 7449 -4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
9e267bf
to
e4ddca7
Compare
@ltalirz After a lot of debugging, I finally found the source of the failure of the test I added. I opened a separate PR for it #5717 as to not drown it out. In essence, various other non-related tests messed with the global variables of the config and didn't properly reset them, causing the daemon fixtures to break, and the test I added in this PR to fail. It tried starting the daemon (which it actually was) but the |
e083e22
to
4dcd3c7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sphuber !
After our discussion, I think I understand the thought process.
Whether caused by automatic requeuing or by user intervention, excepting a process because of a DuplicateSubscriberIdentifier
is not user-friendly, and warning about it in the logs instead is a better way forward (until the root cause of the missing tasks / progress on tasks is fixed).
Let's first merge the PR that fixes the test fixture and then this one.
Also, as we discussed, before we merge this could you please check
- what is RabbitMQ's mechanism for clearing the subscriber list of a channel?
- is it conceivable that rabbitmq requeues tasks because of a lost heartbeat before it had time to update the subscriber list of the corresponding channel?
- do we have a test of this "subscriber clearing" mechanism somewhere (not necessarily in AiiDA)?
23858c6
to
7e6c2af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber , the PR now looks good to merge to me
just waiting for the input on the rabbitmq side
7e6c2af
to
4bf486e
Compare
I did some further digging to be able to answer your questions and found out that the situation is not quite as I originally thought. It seems that the tracking of subscribers is not done by RabbitMQ but by This means that although this PR will protect against a single daemon worker from working on the same process twice, it appears that it is possible that another daemon worker could start working on a process that is already being worked on, without incurring the exception. If this were to happen, I would expect an exception on our ORM level, because one of the two workers would perform an action, and then the second would come along and do the same that would eventually violate some check, for example the process already having been terminated. I have to say that I am really suprised that we have had mostly reports so far of I have discussed all of this with @muhrin and he seems to agree with the analysis. I have updated the description in the changes of the PR and the commit messages. We advise that we still merge this, since it should still provide some protection against these problems. Another problem is that the reports that we are basing most of this on are from 2 years ago and older. It would be good to run new stress tests with v2.1 (once the upgraded comms stack has been merged with updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the
DuplicateSubscriberIdentifier
is only ever raised if the same worker tries to subscribe to a process again. This is because the "cache" of registered subscribes is kept in memory by the libraries and so not shared across daemon workers.
I see. So, the DuplicateSubscriberIdentifier
is anyhow a rather weak guard.
In effect, this PR ends up treating the case of one worker being assigned the same process twice the same as two different workers being assigned the same process (except for the log message).
This makes sense to me.
This means that although this PR will protect against a single daemon worker from working on the same process twice, it appears that it is possible that another daemon worker could start working on a process that is already being worked on, without incurring the exception. If this were to happen, I would expect an exception on our ORM level, because one of the two workers would perform an action, and then the second would come along and do the same that would eventually violate some check, for example the process already having been terminated.
It might make sense to check this once. But I assume you are right.
I have thought about how to do this, but it is not trivial, since we cannot directly control to which subscriber RabbitMQ sends a message. So if we create a duplicate, it may very well end up with the original worker. Maybe we can connect a worker with a single slot (so it will only ever take one task) and then launch two workers. I think that might work. |
It is possible that when a daemon worker tries to continue a process, that a ``kiwipy.DuplicateSubscriberError`` is raised. This happens when the current worker has already subscribed itself with this process identifier. The call to ``_continue`` will call ``Process.init`` which will add RPC and broadcast subscribers. ``kiwipy`` and ``aiormq`` further down keep track of processes that are already subscribed and if subscribed again, a ``DuplicateSubscriberIdentifier`` is raised. Possible reasons for the worker receiving a process task that it already has, include: 1. The user mistakenly recreates the corresponding task, thinking the original task was lost. 2. RabbitMQ requeues the task because the daemon worker lost its connection or did not respond to the heartbeat in time, and the task is sent to the same worker once it regains connection. Here we assume that the existence of another subscriber indicates that the process is still being run by this worker. We thus ignore the request to have the worker take it on again and acknowledge the current task. If our assumption was wrong and the original task was no longer being worked on, the user can resubmit the task once the list of subscribers of the process has been cleared. Note: In the second case we are deleting the *original* task, and once the worker finishes running the process there won't be a task in RabbitMQ to acknowledge anymore. This, however, is silently ignored. Note: the exception is raised by ``kiwipy`` based on an internal cache it and ``aiormq`` keep of the current subscribers. This means that this will only occur when the tasks is resent to the *same* daemon worker. If another worker were to receive it, no exception would be raised as the check is client and not server based.
This is more expensive than the original idea of having it session scoped since now the daemon is stopped after each function usage, however, not stopping the daemon can cause tests to fail if previous tests leave it in an inconsistent state.
4bf486e
to
61f257a
Compare
Fixes #4598
Fixes #3973
It is possible that when a daemon worker tries to continue a process, that a
kiwipy.DuplicateSubscriberError
is raised, which means that there already is another system process that has subscribed to be running that process. This can occur for at least two reasons:In both cases, the actual task is still actually being run and we should not let this exception except the entire process. Unfortunately, these duplicate tasks still happen quite a lot when the daemon workers are under heavy load and we don't want a bunch of processes to be excepted because of this.
In most cases, just ignoring the task wil be the best solution. In the end, the original task is likely to complete. If it turns out that the task actually got lost and the process is stuck, the user can find this error message in the logs and manually recreate the task, using for example
verdi devel revive
.