runc delete: call systemd's reset-failed #3888

kolyshkin · 2023-06-05T19:57:27Z

runc delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set).

Call reset-failed from manager.Destroy so the failed unit will be removed during "runc delete".

This fixes Issue A from #3780 (which, in its original form, can only be reproduced with RHEL/CentOS 9 systemd version < 252.14, i.e. before they've added redhat-plumbers/systemd-rhel9#149). A test case that works with any recent systemd version is also added.

kolyshkin · 2023-06-07T01:44:33Z

Here's a bigger picture. Currently, cri-o (possibly containerd, too, I haven't checked) sets CollectMode: inactive-or-failed systemd unit property. This was done for two [somewhat interconnected] reasons:

To prevent accumulating the leftover failed units.
To [indirectly] workaround the runc bug wrt UnitExists (runc systemd cgroup driver logic is wrong #3780, [1.1] Fix systemd cgroup driver's Apply (and make CI green again) #3806).

Since the second reason (runc bug) is fixed in 1.1.6, the only reason is the leftover failed units. In fact, it happens because runc delete fails to remove the failed systemd unit. This is what this PR fixes.

Retrospectively, setting inactive-or-failed was not a good solution, because it is not possible to get failed systemd unit status (such as "OOM-killed"). In cri-o world, OOM detection is performed by conmon (or conmon-rs), which is racing with systemd, which removes the container's cgroup and unit. As a result, sometimes conmon fails to detect OOM-kill.

The solution to this race is to check the systemd unit status, but due to the above CollectMode: inactive-or-failed setting, the unit is being removed so there's no way to find out what happens (aside from reading systemd logs, which is ugly).

kolyshkin · 2023-06-09T19:31:04Z

@opencontainers/runc-maintainers PTAL

kolyshkin · 2023-06-09T22:21:49Z

going to add an int test first

kolyshkin · 2023-06-27T15:34:49Z

@opencontainers/runc-maintainers PTAL

kolyshkin · 2023-06-27T15:35:25Z

I feel like his one needs to be backported to 1.1

There is no such thing as linux.resources.memorySwap (the mem+swap is set as linux.resources.memory.swap). As it is not used in this test anyway, remove it. Fixes: 4929c05 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Sometimes we call resetFailedUnit as a cleanup measure, and we don't care if it fails or not. So, move error reporting to its callers, and ignore error in cases we don't really expect it to succeed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

runc delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set). Call reset-failed from manager.Destroy so the failed unit will be removed during "runc delete". Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

The passing run (with the fix) looks like this: ---- delete.bats ✓ runc delete removes failed systemd unit [4556] runc spec (status=0): runc run -d --console-socket /tmp/bats-run-B08vu1/runc.lbQwU5/tty/sock test-failed-unit (status=0): Warning: The unit file, source configuration file or drop-ins of runc-cgroups-integration-test-12869.scope changed on disk. Run 'systemctl daemon-reload' to reload units. × runc-cgroups-integration-test-12869.scope - libcontainer container integration-test-12869 Loaded: loaded (/run/systemd/transient/runc-cgroups-integration-test-12869.scope; transient) Transient: yes Drop-In: /run/systemd/transient/runc-cgroups-integration-test-12869.scope.d └─50-DevicePolicy.conf, 50-DeviceAllow.conf Active: failed (Result: timeout) since Tue 2023-06-13 14:41:38 PDT; 751ms ago Duration: 2.144s CPU: 8ms Jun 13 14:41:34 kir-rhat systemd[1]: Started runc-cgroups-integration-test-12869.scope - libcontainer container integration-test-12869. Jun 13 14:41:37 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Scope reached runtime time limit. Stopping. Jun 13 14:41:38 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Stopping timed out. Killing. Jun 13 14:41:38 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Killing process 1107438 (sleep) with signal SIGKILL. Jun 13 14:41:38 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Failed with result 'timeout'. runc delete test-failed-unit (status=0): Unit runc-cgroups-integration-test-12869.scope could not be found. ---- Before the fix, the test was failing like this: ---- delete.bats ✗ runc delete removes failed systemd unit (in test file tests/integration/delete.bats, line 194) `run -4 systemctl status "$SD_UNIT_NAME"' failed, expected exit code 4, got 3 .... ---- Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2023-06-28T16:30:33Z

@opencontainers/runc-maintainers PTAL (want to include the backport into runc v1.1.8). This is relatively easy to review.

hqhq

LGTM

libcontainer/cgroups/systemd/common.go

According to the OCI runtime spec, runtime's delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set). Leaving a leftover failed unit is a violation of runtime spec; in addition, a leftover unit result in inability to start a container with the same systemd unit name (such operation will fail with "unit already exists" error). Call reset-failed from systemd's cgroup manager destroy_cgroup call, so the failed unit will be removed (by systemd) after "crun delete". This change is similar to the one in runc (see [1]). A (slightly modified) test case from runc added by the above change was used to check that the bug is fixed. For bigger picture, see [2] (issue A) and [3]. To test manually, systemd >= 244 is needed. Create a container config that runs "sleep 10" and has the following systemd annotations: org.systemd.property.RuntimeMaxUSec: "uint64 2000000" org.systemd.property.TimeoutStopUSec: "uint64 1000000" Start a container using --systemd-cgroup option. The container will be killed by systemd in 2 seconds, thus its systemd unit status will be "failed". Once it has failed, the "systemctl status $UNIT_NAME" should have exit code of 3 (meaning "unit is not active"). Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME". It should result in exit code of 4 (meaning "no such unit"). [1] opencontainers/runc#3888 [2] opencontainers/runc#3780 [3] cri-o/cri-o#7035 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

According to the OCI runtime spec [1], runtime's delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set). Leaving a leftover failed unit is a violation of runtime spec; in addition, a leftover unit result in inability to start a container with the same systemd unit name (such operation will fail with "unit already exists" error). Call reset-failed from systemd's cgroup manager destroy_cgroup call, so the failed unit will be removed (by systemd) after "crun delete". This change is similar to the one in runc (see [2]). A (slightly modified) test case from runc added by the above change was used to check that the bug is fixed. For bigger picture, see [3] (issue A) and [4]. To test manually, systemd >= 244 is needed. Create a container config that runs "sleep 10" and has the following systemd annotations: org.systemd.property.RuntimeMaxUSec: "uint64 2000000" org.systemd.property.TimeoutStopUSec: "uint64 1000000" Start a container using --systemd-cgroup option. The container will be killed by systemd in 2 seconds, thus its systemd unit status will be "failed". Once it has failed, the "systemctl status $UNIT_NAME" should have exit code of 3 (meaning "unit is not active"). Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME". It should result in exit code of 4 (meaning "no such unit"). [1] https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#delete [2] opencontainers/runc#3888 [3] opencontainers/runc#3780 [4] cri-o/cri-o#7035 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin added area/systemd backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels Jun 5, 2023

kolyshkin mentioned this pull request Jun 5, 2023

test: simplify and fix metrics oom test cri-o/cri-o#6973

Closed

kolyshkin force-pushed the reset-failed branch 2 times, most recently from f228e7b to 860d7fa Compare June 7, 2023 00:48

kolyshkin marked this pull request as ready for review June 7, 2023 00:48

kolyshkin marked this pull request as draft June 9, 2023 22:21

kolyshkin force-pushed the reset-failed branch from 860d7fa to 51b09eb Compare June 13, 2023 00:34

kolyshkin mentioned this pull request Jun 13, 2023

OOM detection is racy when using systemd cgroup driver cri-o/cri-o#7035

Closed

kolyshkin force-pushed the reset-failed branch 3 times, most recently from 6196fa1 to 6ef989c Compare June 14, 2023 00:26

kolyshkin marked this pull request as ready for review June 14, 2023 01:28

kolyshkin added 5 commits June 28, 2023 09:28

tests/int/cgroups: remove useless/wrong setting

dacb3aa

There is no such thing as linux.resources.memorySwap (the mem+swap is set as linux.resources.memory.swap). As it is not used in this test anyway, remove it. Fixes: 4929c05 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

tests/int: add/use "requires systemd_vNNN"

58a811f

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the reset-failed branch from 6ef989c to ad040b1 Compare June 28, 2023 16:28

hqhq approved these changes Jun 29, 2023

View reviewed changes

lifubang reviewed Jul 3, 2023

View reviewed changes

libcontainer/cgroups/systemd/common.go Show resolved Hide resolved

mrunalp approved these changes Jul 6, 2023

View reviewed changes

mrunalp merged commit 369ad5a into opencontainers:main Jul 6, 2023

kolyshkin mentioned this pull request Jul 8, 2023

[1.1] runc delete: call systemd's reset-failed #3932

Merged

kolyshkin added backport/1.1-done A PR in main branch which has been backported to release-1.1 and removed backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels Jul 8, 2023

kolyshkin mentioned this pull request Sep 6, 2023

crun delete: call systemd's reset-failed containers/crun#1295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc delete: call systemd's reset-failed #3888

runc delete: call systemd's reset-failed #3888

kolyshkin commented Jun 5, 2023 •

edited

Loading

kolyshkin commented Jun 7, 2023

kolyshkin commented Jun 9, 2023

kolyshkin commented Jun 9, 2023

kolyshkin commented Jun 27, 2023

kolyshkin commented Jun 27, 2023

kolyshkin commented Jun 28, 2023

hqhq left a comment

runc delete: call systemd's reset-failed #3888

runc delete: call systemd's reset-failed #3888

Conversation

kolyshkin commented Jun 5, 2023 • edited Loading

kolyshkin commented Jun 7, 2023

kolyshkin commented Jun 9, 2023

kolyshkin commented Jun 9, 2023

kolyshkin commented Jun 27, 2023

kolyshkin commented Jun 27, 2023

kolyshkin commented Jun 28, 2023

hqhq left a comment

Choose a reason for hiding this comment

kolyshkin commented Jun 5, 2023 •

edited

Loading