-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processes become unreachable when rebooting local machine #5699
Comments
Thanks for the report. Could you provide an answer to the following questions:
You say that all |
Thank you Sebastiaan for the questions.
Please note that if I have something like 4 daemons running, aiida complains that 400% of the workers are occupied, even when there are no processes in
I think I mentioned at the very beginning of the bug description that the process cannot be played or killed. But yes I could have been clearer.
Next I try |
Following is the error I receive when I run
PublishError Traceback (most recent call last) ~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py in rpc_send(self, recipient_id, msg) ~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py in publish_expect_response(self, message, routing_key, mandatory) ~/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py in publish(self, message, routing_key, mandatory) ~/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/exchange.py in publish(self, message, routing_key, mandatory, immediate, timeout) UnroutableError: ('NO_ROUTE', '[rpc].510265') |
Thanks for the additional info.
You are right, it is just that the behavior you describe is quite unique and not something I have come across. The only cause so far that I have come across for a process not being reachable is that the corresponding RabbitMQ task is missing. So far doing the
A daemon slot is occupied also by processes that are in the It is really difficult to debug further over github, so I will contact you on slack. |
After some interactive debugging, the situation seemed to be the following:
There is not much we can do about the daemon workers being unresponsive when the plugins they are running are IO-heavy. However, I have submitted a PR #5715 that will prevent the calculations from excepting if a duplicate task is erroneously created. |
@sphuber is there any general way to handle errors like I have got a bunch of unreachable |
The problem is that these processes no longer have a task with rabbitmq. If you don't care about them anymore, nor the data, you can simply delete them using |
Describe the bug
When there are a few hundred aiida jobs running on external cluster and the local machine is rebooted all the running jobs become unreachable on completion, i.e. they get stuck in waiting state and cannot be played or killed or salvaged in any manner. I have tried using
get_manager().get_process_controller().continue_process(pk)
but it doesn't do anything either.Now there's a second level to this issue. All these unreachable processes still occupy the daemon slots, making aiida complain that there aren't enough daemons available even when there are zero jobs running. Since these stuck jobs cannot be killed, the only way to eliminate them is to delete the nodes, but that is not an option if one wants to salvage the calculations and use their output in future calculations. For example - if one is running a PwRelaxWorkChain and only the final scf on relaxed structure gets stuck, making the entire workchain unreachable, it would be desirable to only run the final scf instead of running the entire PwRelaxWorkChain. It is even more important to salvage the already completed calculations when the mother workchain is much more complex than PwRelaxWorkChain.
Steps to reproduce
Steps to reproduce the behavior:
Expected behavior
Your environment
I have had this issue on 2 separate environments.
First environment
Second environment
Additional context
From what I understand it is the rabbitmq server that's causing the jobs to become unreachable.
A similar issue seems to be still unresolved.
The text was updated successfully, but these errors were encountered: