Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make channel for StartTransientUnit buffered #1781

Merged
merged 1 commit into from
Apr 24, 2018

Conversation

filbranden
Copy link
Contributor

So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation.

This should address Issue #1780 and it's a follow up for PR #1683, PR #1754 and PR #1772. Also relevant is kubernetes/kubernetes#61926 (which is likely to need to be updated to contain this one after it's merged.)

@derekwaynecarr @sjenning @vikaschoudhary16 @mrunalp

This might need some more testing... Do you have a reproducer that triggers the problem? It would need to trigger the timeout in libcontainer to cause the bug that will later cause the deadlock. If there's a way to make the cgroup unit creation take longer than 1s, that might be enough...

Thanks!
Filipe

@cyphar
Copy link
Member

cyphar commented Apr 14, 2018

You need to add a Signed-off-by: line to your commit(s) which indicates that you attest the Developer Certificate of Origin a statement about your contributions that you must read before signing (don't worry, it's quite short and easy-to-read). You can add it to your commits with git commit --amend -s, and then doing a git push --force.

So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.

This should address Issue opencontainers#1780 and it's a follow up for PR opencontainers#1683,
PR opencontainers#1754 and PR opencontainers#1772.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
@filbranden
Copy link
Contributor Author

You need to add a Signed-off-by: line to your commit(s)

Ah yes sorry about that... I did know about it, I just keep forgetting it (it turns out right now this is the only project I'm contributing to that needs it...) Anyways, fixed now. Thanks!

@sjenning
Copy link
Contributor

sjenning commented Apr 18, 2018

Sorry, I'm not understanding what issue this is fixing. Can someone explain it to me in small words?

@filbranden
Copy link
Contributor Author

Hi @sjenning

If we time out waiting for systemd to reply through D-Bus, then no one will consume any message in statusChan.

Later on, when the D-Bus reply is received, this code in go-systemd will try to update the channel to indicate completion. But as no one is consuming it, it will block forever.

The situation gets worse because the code in go-systemd is holding the c.jobListener lock, so essentially everything in go-systemd gets deadlocked at this point.

This fix here just makes the channel buffered with one slot, so that when go-systemd updates it, it doesn't get blocked. If we timed out, the message is lost (which is fine) but no one gets blocked.

See also the discussion here: kubernetes/kubernetes#61926 (comment), where @derekwaynecarr mentions the jobListener lock (that's how I tracked this down.)

I also tried to summarize the whole situation (including the history of PRs) on #1780, so try to go through that one if it's still unclear to you...

Thanks!
Filipe

@filbranden
Copy link
Contributor Author

Ping @sjenning @derekwaynecarr

I'd like to have this one figured out so we can wrap up kubernetes/kubernetes#61926 update of the vendored libcontainer this week...

Cheers,
Filipe

@crosbymichael
Copy link
Member

crosbymichael commented Apr 24, 2018

LGTM

Approved with PullApprove

1 similar comment
@mrunalp
Copy link
Contributor

mrunalp commented Apr 24, 2018

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 871ba2e into opencontainers:master Apr 24, 2018
filbranden added a commit to filbranden/kubernetes that referenced this pull request Apr 24, 2018
PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that
makes Kubelet startup hang when using systemd cgroup driver (by adding a
timeout) and further PR opencontainers/runc#1772 fixes that bug by
checking the proper error status before waiting on the channel.

PR opencontainers/runc#1776 checks whether Delegate works in slices,
which keeps libcontainer systemd cgroup driver working on systemd v237+.

PR opencontainers/runc#1781 makes the channel buffered, so if we time
out waiting on the channel, the updater will not block trying to it
since there are no longer any consumers.
filbranden added a commit to filbranden/kubernetes that referenced this pull request Apr 25, 2018
PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that
makes Kubelet startup hang when using systemd cgroup driver (by adding a
timeout) and further PR opencontainers/runc#1772 fixes that bug by
checking the proper error status before waiting on the channel.

PR opencontainers/runc#1776 checks whether Delegate works in slices,
which keeps libcontainer systemd cgroup driver working on systemd v237+.

PR opencontainers/runc#1781 makes the channel buffered, so if we time
out waiting on the channel, the updater will not block trying to it
since there are no longer any consumers.
mrunalp added a commit to projectatomic/runc that referenced this pull request Jun 12, 2018
mrunalp added a commit to projectatomic/runc that referenced this pull request Jun 12, 2018
@filbranden filbranden deleted the systemd3 branch February 7, 2019 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants