Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc delete: call systemd's reset-failed #3888

Merged
merged 5 commits into from
Jul 6, 2023

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jun 5, 2023

runc delete is supposed to remove all the container's artefacts. In case systemd cgroup driver is used, and the systemd unit has failed (e.g. oom-killed), systemd won't remove the unit (that is, unless the "CollectMode: inactive-or-failed" property is set).

Call reset-failed from manager.Destroy so the failed unit will be removed during "runc delete".

This fixes Issue A from #3780 (which, in its original form, can only be reproduced with RHEL/CentOS 9 systemd version < 252.14, i.e. before they've added redhat-plumbers/systemd-rhel9#149). A test case that works with any recent systemd version is also added.

@kolyshkin kolyshkin added area/systemd backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels Jun 5, 2023
@kolyshkin kolyshkin force-pushed the reset-failed branch 2 times, most recently from f228e7b to 860d7fa Compare June 7, 2023 00:48
@kolyshkin kolyshkin marked this pull request as ready for review June 7, 2023 00:48
@kolyshkin
Copy link
Contributor Author

Here's a bigger picture. Currently, cri-o (possibly containerd, too, I haven't checked) sets CollectMode: inactive-or-failed systemd unit property. This was done for two [somewhat interconnected] reasons:

  1. To prevent accumulating the leftover failed units.
  2. To [indirectly] workaround the runc bug wrt UnitExists (runc systemd cgroup driver logic is wrong #3780, [1.1] Fix systemd cgroup driver's Apply (and make CI green again) #3806).

Since the second reason (runc bug) is fixed in 1.1.6, the only reason is the leftover failed units. In fact, it happens because runc delete fails to remove the failed systemd unit. This is what this PR fixes.

Retrospectively, setting inactive-or-failed was not a good solution, because it is not possible to get failed systemd unit status (such as "OOM-killed"). In cri-o world, OOM detection is performed by conmon (or conmon-rs), which is racing with systemd, which removes the container's cgroup and unit. As a result, sometimes conmon fails to detect OOM-kill.

The solution to this race is to check the systemd unit status, but due to the above CollectMode: inactive-or-failed setting, the unit is being removed so there's no way to find out what happens (aside from reading systemd logs, which is ugly).

@kolyshkin
Copy link
Contributor Author

@opencontainers/runc-maintainers PTAL

@kolyshkin kolyshkin marked this pull request as draft June 9, 2023 22:21
@kolyshkin
Copy link
Contributor Author

going to add an int test first

@kolyshkin
Copy link
Contributor Author

@opencontainers/runc-maintainers PTAL

@kolyshkin
Copy link
Contributor Author

I feel like his one needs to be backported to 1.1

There is no such thing as linux.resources.memorySwap (the mem+swap is
set as linux.resources.memory.swap).

As it is not used in this test anyway, remove it.

Fixes: 4929c05
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes we call resetFailedUnit as a cleanup measure, and we don't
care if it fails or not. So, move error reporting to its callers, and
ignore error in cases we don't really expect it to succeed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
runc delete is supposed to remove all the container's artefacts.
In case systemd cgroup driver is used, and the systemd unit has failed
(e.g. oom-killed), systemd won't remove the unit (that is, unless the
"CollectMode: inactive-or-failed" property is set).

Call reset-failed from manager.Destroy so the failed unit will be
removed during "runc delete".

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The passing run (with the fix) looks like this:

----
delete.bats
 ✓ runc delete removes failed systemd unit [4556]
   runc spec (status=0):

   runc run -d --console-socket /tmp/bats-run-B08vu1/runc.lbQwU5/tty/sock test-failed-unit (status=0):

   Warning: The unit file, source configuration file or drop-ins of runc-cgroups-integration-test-12869.scope changed on disk. Run 'systemctl daemon-reload' to reload units.
   × runc-cgroups-integration-test-12869.scope - libcontainer container integration-test-12869
        Loaded: loaded (/run/systemd/transient/runc-cgroups-integration-test-12869.scope; transient)
     Transient: yes
       Drop-In: /run/systemd/transient/runc-cgroups-integration-test-12869.scope.d
                └─50-DevicePolicy.conf, 50-DeviceAllow.conf
        Active: failed (Result: timeout) since Tue 2023-06-13 14:41:38 PDT; 751ms ago
      Duration: 2.144s
           CPU: 8ms

   Jun 13 14:41:34 kir-rhat systemd[1]: Started runc-cgroups-integration-test-12869.scope - libcontainer container integration-test-12869.
   Jun 13 14:41:37 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Scope reached runtime time limit. Stopping.
   Jun 13 14:41:38 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Stopping timed out. Killing.
   Jun 13 14:41:38 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Killing process 1107438 (sleep) with signal SIGKILL.
   Jun 13 14:41:38 kir-rhat systemd[1]: runc-cgroups-integration-test-12869.scope: Failed with result 'timeout'.
   runc delete test-failed-unit (status=0):

   Unit runc-cgroups-integration-test-12869.scope could not be found.
----

Before the fix, the test was failing like this:

----
delete.bats
 ✗ runc delete removes failed systemd unit
   (in test file tests/integration/delete.bats, line 194)
     `run -4 systemctl status "$SD_UNIT_NAME"' failed, expected exit code 4, got 3
  ....
----

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

@opencontainers/runc-maintainers PTAL (want to include the backport into runc v1.1.8). This is relatively easy to review.

Copy link
Contributor

@hqhq hqhq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mrunalp mrunalp merged commit 369ad5a into opencontainers:main Jul 6, 2023
@kolyshkin kolyshkin added backport/1.1-done A PR in main branch which has been backported to release-1.1 and removed backport/1.1-todo A PR in main branch which needs to be backported to release-1.1 labels Jul 8, 2023
kolyshkin added a commit to kolyshkin/crun that referenced this pull request Sep 6, 2023
According to the OCI runtime spec, runtime's delete is supposed to
remove all the container's artefacts.

In case systemd cgroup driver is used, and the systemd unit has failed
(e.g. oom-killed), systemd won't remove the unit (that is, unless the
"CollectMode: inactive-or-failed" property is set).

Leaving a leftover failed unit is a violation of runtime spec; in
addition, a leftover unit result in inability to start a container with
the same systemd unit name (such operation will fail with "unit already
exists" error).

Call reset-failed from systemd's cgroup manager destroy_cgroup call,
so the failed unit will be removed (by systemd) after "crun delete".

This change is similar to the one in runc (see [1]). A (slightly
modified) test case from runc added by the above change was used to
check that the bug is fixed.

For bigger picture, see [2] (issue A) and [3].

To test manually, systemd >= 244 is needed. Create a container config
that runs "sleep 10" and has the following systemd annotations:

	org.systemd.property.RuntimeMaxUSec: "uint64 2000000"
	org.systemd.property.TimeoutStopUSec: "uint64 1000000"

Start a container using --systemd-cgroup option.

The container will be killed by systemd in 2 seconds, thus its systemd
unit status will be "failed". Once it has failed, the "systemctl status
$UNIT_NAME" should have exit code of 3 (meaning "unit is not active").

Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME".
It should result in exit code of 4 (meaning "no such unit").

[1] opencontainers/runc#3888
[2] opencontainers/runc#3780
[3] cri-o/cri-o#7035

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
kolyshkin added a commit to kolyshkin/crun that referenced this pull request Sep 6, 2023
According to the OCI runtime spec [1], runtime's delete is supposed to
remove all the container's artefacts.

In case systemd cgroup driver is used, and the systemd unit has failed
(e.g. oom-killed), systemd won't remove the unit (that is, unless the
"CollectMode: inactive-or-failed" property is set).

Leaving a leftover failed unit is a violation of runtime spec; in
addition, a leftover unit result in inability to start a container with
the same systemd unit name (such operation will fail with "unit already
exists" error).

Call reset-failed from systemd's cgroup manager destroy_cgroup call,
so the failed unit will be removed (by systemd) after "crun delete".

This change is similar to the one in runc (see [2]). A (slightly
modified) test case from runc added by the above change was used to
check that the bug is fixed.

For bigger picture, see [3] (issue A) and [4].

To test manually, systemd >= 244 is needed. Create a container config
that runs "sleep 10" and has the following systemd annotations:

	org.systemd.property.RuntimeMaxUSec: "uint64 2000000"
	org.systemd.property.TimeoutStopUSec: "uint64 1000000"

Start a container using --systemd-cgroup option.

The container will be killed by systemd in 2 seconds, thus its systemd
unit status will be "failed". Once it has failed, the "systemctl status
$UNIT_NAME" should have exit code of 3 (meaning "unit is not active").

Now, run "crun delete $CTID" and repeat "systemctl status $UNIT_NAME".
It should result in exit code of 4 (meaning "no such unit").

[1] https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#delete
[2] opencontainers/runc#3888
[3] opencontainers/runc#3780
[4] cri-o/cri-o#7035

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/systemd backport/1.1-done A PR in main branch which has been backported to release-1.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants