Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make supervisor setup more reliable #1300

Merged
merged 1 commit into from
Apr 30, 2018

Conversation

wingo
Copy link
Contributor

@wingo wingo commented Mar 19, 2018

There was a race condition when setting up the supervisor such that in
some cases it was possible to miss a signal when the parent process
died. We could reproduce this with by running a "snabb lwaftr bench",
but only on our test machine with two NUMA nodes and only when setting
--cpu on the lwaftr. In that case the problem would appear when
running "snabb lwaftr monitor" on the lwaftr, whose supervisor process
would hang reading from the signalfd. Because the supervisor process
still had stdout open, then when piping monitor output to "grep", the grep
process would hang because the write side of its stdin pipe would
still be open as well.

This patch fixes this error by making cleanup reliable. It does so by
taking a POSIX lock on an unnamed file in the parent, then taking
another lock from the supervisor child process. In this way we avoid
some of the more arcane parts of Linux.

There was a race condition when setting up the supervisor such that in
some cases it was possible to miss a signal when the parent process
died.  We could reproduce this with by running a "snabb lwaftr bench",
but only on our test machine with two NUMA nodes and only when setting
--cpu on the lwaftr.  In that case the problem would appear when
running "snabb lwaftr monitor" on the lwaftr; the monitor process
would hang reading from the signalfd.  Because the monitor process
still had stdout open, then when piping its output to "grep", the grep
process would hang because the write side of its stdin pipe would
still be open as well.

This patch fixes this error by making cleanup reliable.  It does so by
taking a POSIX lock on an unnamed file in the parent, then taking
another lock from the supervisor child process.  In this way we avoid
some of the more arcane parts of Linux.
@wingo
Copy link
Contributor Author

wingo commented Mar 19, 2018

See comment at Igalia#1031 (comment) for more details about the failure.

@eugeneia
Copy link
Member

Interviewed @wingo about this on Slack. LGTM, merging onto max-next. :-)

@eugeneia eugeneia merged commit da3a2f8 into snabbco:master Apr 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants