-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make channel for StartTransientUnit buffered #1781
Conversation
You need to add a |
So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation. This should address Issue opencontainers#1780 and it's a follow up for PR opencontainers#1683, PR opencontainers#1754 and PR opencontainers#1772. Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Ah yes sorry about that... I did know about it, I just keep forgetting it (it turns out right now this is the only project I'm contributing to that needs it...) Anyways, fixed now. Thanks! |
Sorry, I'm not understanding what issue this is fixing. Can someone explain it to me in small words? |
Hi @sjenning If we time out waiting for systemd to reply through D-Bus, then no one will consume any message in Later on, when the D-Bus reply is received, this code in go-systemd will try to update the channel to indicate completion. But as no one is consuming it, it will block forever. The situation gets worse because the code in go-systemd is holding the This fix here just makes the channel buffered with one slot, so that when go-systemd updates it, it doesn't get blocked. If we timed out, the message is lost (which is fine) but no one gets blocked. See also the discussion here: kubernetes/kubernetes#61926 (comment), where @derekwaynecarr mentions the jobListener lock (that's how I tracked this down.) I also tried to summarize the whole situation (including the history of PRs) on #1780, so try to go through that one if it's still unclear to you... Thanks! |
Ping @sjenning @derekwaynecarr I'd like to have this one figured out so we can wrap up kubernetes/kubernetes#61926 update of the vendored libcontainer this week... Cheers, |
1 similar comment
PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that makes Kubelet startup hang when using systemd cgroup driver (by adding a timeout) and further PR opencontainers/runc#1772 fixes that bug by checking the proper error status before waiting on the channel. PR opencontainers/runc#1776 checks whether Delegate works in slices, which keeps libcontainer systemd cgroup driver working on systemd v237+. PR opencontainers/runc#1781 makes the channel buffered, so if we time out waiting on the channel, the updater will not block trying to it since there are no longer any consumers.
PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that makes Kubelet startup hang when using systemd cgroup driver (by adding a timeout) and further PR opencontainers/runc#1772 fixes that bug by checking the proper error status before waiting on the channel. PR opencontainers/runc#1776 checks whether Delegate works in slices, which keeps libcontainer systemd cgroup driver working on systemd v237+. PR opencontainers/runc#1781 makes the channel buffered, so if we time out waiting on the channel, the updater will not block trying to it since there are no longer any consumers.
opencontainers/runc#1683 opencontainers/runc#1754 opencontainers/runc#1772 opencontainers/runc#1781 Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
opencontainers/runc#1683 opencontainers/runc#1754 opencontainers/runc#1772 opencontainers/runc#1781 Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation.
This should address Issue #1780 and it's a follow up for PR #1683, PR #1754 and PR #1772. Also relevant is kubernetes/kubernetes#61926 (which is likely to need to be updated to contain this one after it's merged.)
@derekwaynecarr @sjenning @vikaschoudhary16 @mrunalp
This might need some more testing... Do you have a reproducer that triggers the problem? It would need to trigger the timeout in libcontainer to cause the bug that will later cause the deadlock. If there's a way to make the cgroup unit creation take longer than 1s, that might be enough...
Thanks!
Filipe