Make supervisor setup more reliable #1300
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a race condition when setting up the supervisor such that in
some cases it was possible to miss a signal when the parent process
died. We could reproduce this with by running a "snabb lwaftr bench",
but only on our test machine with two NUMA nodes and only when setting
--cpu on the lwaftr. In that case the problem would appear when
running "snabb lwaftr monitor" on the lwaftr, whose supervisor process
would hang reading from the signalfd. Because the supervisor process
still had stdout open, then when piping
monitor
output to "grep", the grepprocess would hang because the write side of its stdin pipe would
still be open as well.
This patch fixes this error by making cleanup reliable. It does so by
taking a POSIX lock on an unnamed file in the parent, then taking
another lock from the supervisor child process. In this way we avoid
some of the more arcane parts of Linux.