Make channel for StartTransientUnit buffered #1781

filbranden · 2018-04-14T06:41:48Z

So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation.

This should address Issue #1780 and it's a follow up for PR #1683, PR #1754 and PR #1772. Also relevant is kubernetes/kubernetes#61926 (which is likely to need to be updated to contain this one after it's merged.)

@derekwaynecarr @sjenning @vikaschoudhary16 @mrunalp

This might need some more testing... Do you have a reproducer that triggers the problem? It would need to trigger the timeout in libcontainer to cause the bug that will later cause the deadlock. If there's a way to make the cgroup unit creation take longer than 1s, that might be enough...

Thanks!
Filipe

cyphar · 2018-04-14T07:52:00Z

You need to add a Signed-off-by: line to your commit(s) which indicates that you attest the Developer Certificate of Origin a statement about your contributions that you must read before signing (don't worry, it's quite short and easy-to-read). You can add it to your commits with git commit --amend -s, and then doing a git push --force.

So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation. This should address Issue opencontainers#1780 and it's a follow up for PR opencontainers#1683, PR opencontainers#1754 and PR opencontainers#1772. Signed-off-by: Filipe Brandenburger <filbranden@google.com>

filbranden · 2018-04-14T15:51:10Z

You need to add a Signed-off-by: line to your commit(s)

Ah yes sorry about that... I did know about it, I just keep forgetting it (it turns out right now this is the only project I'm contributing to that needs it...) Anyways, fixed now. Thanks!

sjenning · 2018-04-18T01:55:59Z

Sorry, I'm not understanding what issue this is fixing. Can someone explain it to me in small words?

filbranden · 2018-04-18T02:44:20Z

Hi @sjenning

If we time out waiting for systemd to reply through D-Bus, then no one will consume any message in statusChan.

Later on, when the D-Bus reply is received, this code in go-systemd will try to update the channel to indicate completion. But as no one is consuming it, it will block forever.

The situation gets worse because the code in go-systemd is holding the c.jobListener lock, so essentially everything in go-systemd gets deadlocked at this point.

This fix here just makes the channel buffered with one slot, so that when go-systemd updates it, it doesn't get blocked. If we timed out, the message is lost (which is fine) but no one gets blocked.

See also the discussion here: kubernetes/kubernetes#61926 (comment), where @derekwaynecarr mentions the jobListener lock (that's how I tracked this down.)

I also tried to summarize the whole situation (including the history of PRs) on #1780, so try to go through that one if it's still unclear to you...

Thanks!
Filipe

filbranden · 2018-04-23T20:06:17Z

Ping @sjenning @derekwaynecarr

I'd like to have this one figured out so we can wrap up kubernetes/kubernetes#61926 update of the vendored libcontainer this week...

Cheers,
Filipe

crosbymichael · 2018-04-24T14:23:31Z

LGTM

mrunalp · 2018-04-24T18:56:30Z

LGTM

PR opencontainers/runc#1754 works around an issue in manager.Apply(-1) that makes Kubelet startup hang when using systemd cgroup driver (by adding a timeout) and further PR opencontainers/runc#1772 fixes that bug by checking the proper error status before waiting on the channel. PR opencontainers/runc#1776 checks whether Delegate works in slices, which keeps libcontainer systemd cgroup driver working on systemd v237+. PR opencontainers/runc#1781 makes the channel buffered, so if we time out waiting on the channel, the updater will not block trying to it since there are no longer any consumers.

opencontainers/runc#1683 opencontainers/runc#1754 opencontainers/runc#1772 opencontainers/runc#1781 Signed-off-by: Mrunal Patel <mrunalp@gmail.com>

filbranden force-pushed the systemd3 branch from e2b796e to 165ee45 Compare April 14, 2018 15:50

filbranden mentioned this pull request Apr 16, 2018

Update libcontainer to include PRs with fixes to systemd cgroup driver kubernetes/kubernetes#61926

Merged

filbranden mentioned this pull request Apr 20, 2018

./hack/local-up-cluster.sh fails on Fedora rawhide kubernetes/kubernetes#61474

Closed

mrunalp merged commit 871ba2e into opencontainers:master Apr 24, 2018

filbranden mentioned this pull request Jun 12, 2018

Use uint64 for resources to keep consistency with runtime-spec projectatomic/runc#10

Closed

mrunalp mentioned this pull request Jun 12, 2018

cgroups: Backport of upstream fixes around starting units projectatomic/runc#12

Merged

mrunalp mentioned this pull request Jun 12, 2018

cgroups: Backport of upstream fixes around starting units projectatomic/runc#13

Merged

filbranden deleted the systemd3 branch February 7, 2019 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make channel for StartTransientUnit buffered #1781

Make channel for StartTransientUnit buffered #1781

filbranden commented Apr 14, 2018

cyphar commented Apr 14, 2018

filbranden commented Apr 14, 2018

sjenning commented Apr 18, 2018 •

edited

Loading

filbranden commented Apr 18, 2018

filbranden commented Apr 23, 2018

crosbymichael commented Apr 24, 2018 •

edited by caniszczyk

Loading

mrunalp commented Apr 24, 2018 •

edited by caniszczyk

Loading

Make channel for StartTransientUnit buffered #1781

Make channel for StartTransientUnit buffered #1781

Conversation

filbranden commented Apr 14, 2018

cyphar commented Apr 14, 2018

filbranden commented Apr 14, 2018

sjenning commented Apr 18, 2018 • edited Loading

filbranden commented Apr 18, 2018

filbranden commented Apr 23, 2018

crosbymichael commented Apr 24, 2018 • edited by caniszczyk Loading

mrunalp commented Apr 24, 2018 • edited by caniszczyk Loading

sjenning commented Apr 18, 2018 •

edited

Loading

crosbymichael commented Apr 24, 2018 •

edited by caniszczyk

Loading

mrunalp commented Apr 24, 2018 •

edited by caniszczyk

Loading