Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in process function interrupt handlers causing verdi daemon stop to timeout #2966

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 5, 2019

Fixes #2963

The issue is addressed in two separate commits, because the source of the issue was two-fold. Although this solves this particular issue and makes the handling of interrupts in local runners more robust, a more general problem was laid bare as documented in issue #2965 .
Unfortunately, I haven't found a way to add unit tests for the issue that this PR solves.

Recently, signal handlers were added to catch interrupt signal to kill
active processes that were `run` in a local interpreter. Since process
functions create their own runner, separate handlers had to be added
there. However, these handlers were than also attached when the
functions are run in a daemon runner. Shutting the daemon down would
then trigger these signal handlers and the process functions would be
killed.
@sphuber sphuber requested a review from ltalirz June 5, 2019 12:30
@coveralls
Copy link

coveralls commented Jun 5, 2019

Coverage Status

Coverage decreased (-0.04%) to 72.913% when pulling 705d0fb on sphuber:fix_2963_runner_kill_process_signal into 5ff132e on aiidateam:develop.

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sphuber !
just a few requests for clarification

the same runner and it would be blocking the event loop from continuing.

:param args: input arguments to construct the FunctionProcess
:param kwargs: input keyword arguments to construct the FunctionProcess
:return: tuple of the outputs of the process and the process node pk
:rtype: (dict, int)
"""
current_runner = get_manager().get_runner()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this line directly before the if not ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""Send the kill signal to the process in the current scope."""
LOGGER.critical('runner received interrupt, killing process %s', process.pid)
process.kill(msg='Process was killed because the runner received an interrupt')
# Only add handlers for interrupt signal to kill the current process if we are not in a daemon runner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add what the alternative is (i.e. if you are not in a daemon runner, what type of runner are you in then?).

# Only add handlers for interrupt signal to kill the current process if we are not in a daemon runner
# Without this check, running process functions in a daemon worker would be killed if the daemon is shutdown
original_handler = None
original_signal = signal.SIGINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original_signal somehow suggests there will be a new signal as well, but there isn't right?
I guess you could call it kill_signal...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for the handler

Copy link
Contributor Author

@sphuber sphuber Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point for the signal, I will change it. However, I am not sure if the change for the handler would be good. This stems from the somewhat weird way of unbinding a signal in python. You cannot actually unbind it as such, you simply need to call signal.signal(signal, handler) again and here then handler needs to be what it was before. That is why I store that handle in the original_handler before overriding it, and then later you restore it. I will add a comment where I save the handler

The `run_get_node` method of process functions attaches a `kill_process`
handler to interrupt signals, such that when the funtion process is
interrupted during execution, it is properly cleaned up, for example by
setting the process state on the node to `Killed`. These handlers were,
however, not properly unbound after the process was terminated and so
would be triggered even after the process instance was no longer alive.
This would cause the `kill_process` handler to hang. In the case of a
daemon runner, this would prevent the python process from shutting down.
This manifested itself in `verdi daemon stop` timing out if one of the
daemon runners had run just a single process function. Note that the
previous commit already fixed this problem in a different way, by not
even attaching the interrupt handlers for process function in the first
place. But the reasoning for that is different.
@sphuber sphuber force-pushed the fix_2963_runner_kill_process_signal branch from f291df6 to 705d0fb Compare June 5, 2019 14:59
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber
this is as far as I can review this ;-) not saying I follow everything here in detail...

@sphuber sphuber merged commit 3771945 into aiidateam:develop Jun 5, 2019
@sphuber sphuber deleted the fix_2963_runner_kill_process_signal branch June 5, 2019 19:18
@sphuber
Copy link
Contributor Author

sphuber commented Jun 5, 2019

Thanks for this and all the other thoughtful reviews Leo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stopping the daemon always seems to hit a timeout
3 participants