-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in process function interrupt handlers causing verdi daemon stop
to timeout
#2966
Fix bug in process function interrupt handlers causing verdi daemon stop
to timeout
#2966
Conversation
Recently, signal handlers were added to catch interrupt signal to kill active processes that were `run` in a local interpreter. Since process functions create their own runner, separate handlers had to be added there. However, these handlers were than also attached when the functions are run in a daemon runner. Shutting the daemon down would then trigger these signal handlers and the process functions would be killed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber !
just a few requests for clarification
aiida/engine/processes/functions.py
Outdated
the same runner and it would be blocking the event loop from continuing. | ||
|
||
:param args: input arguments to construct the FunctionProcess | ||
:param kwargs: input keyword arguments to construct the FunctionProcess | ||
:return: tuple of the outputs of the process and the process node pk | ||
:rtype: (dict, int) | ||
""" | ||
current_runner = get_manager().get_runner() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move this line directly before the if not ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
aiida/engine/processes/functions.py
Outdated
"""Send the kill signal to the process in the current scope.""" | ||
LOGGER.critical('runner received interrupt, killing process %s', process.pid) | ||
process.kill(msg='Process was killed because the runner received an interrupt') | ||
# Only add handlers for interrupt signal to kill the current process if we are not in a daemon runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add what the alternative is (i.e. if you are not in a daemon runner, what type of runner are you in then?).
aiida/engine/processes/functions.py
Outdated
# Only add handlers for interrupt signal to kill the current process if we are not in a daemon runner | ||
# Without this check, running process functions in a daemon worker would be killed if the daemon is shutdown | ||
original_handler = None | ||
original_signal = signal.SIGINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
original_signal somehow suggests there will be a new signal as well, but there isn't right?
I guess you could call it kill_signal
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for the handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point for the signal, I will change it. However, I am not sure if the change for the handler would be good. This stems from the somewhat weird way of unbinding a signal in python. You cannot actually unbind it as such, you simply need to call signal.signal(signal, handler)
again and here then handler
needs to be what it was before. That is why I store that handle in the original_handler
before overriding it, and then later you restore it. I will add a comment where I save the handler
The `run_get_node` method of process functions attaches a `kill_process` handler to interrupt signals, such that when the funtion process is interrupted during execution, it is properly cleaned up, for example by setting the process state on the node to `Killed`. These handlers were, however, not properly unbound after the process was terminated and so would be triggered even after the process instance was no longer alive. This would cause the `kill_process` handler to hang. In the case of a daemon runner, this would prevent the python process from shutting down. This manifested itself in `verdi daemon stop` timing out if one of the daemon runners had run just a single process function. Note that the previous commit already fixed this problem in a different way, by not even attaching the interrupt handlers for process function in the first place. But the reasoning for that is different.
f291df6
to
705d0fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sphuber
this is as far as I can review this ;-) not saying I follow everything here in detail...
Thanks for this and all the other thoughtful reviews Leo |
Fixes #2963
The issue is addressed in two separate commits, because the source of the issue was two-fold. Although this solves this particular issue and makes the handling of interrupts in local runners more robust, a more general problem was laid bare as documented in issue #2965 .
Unfortunately, I haven't found a way to add unit tests for the issue that this PR solves.