-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix runc kill
and runc delete
for containers with no init and no private PID namespace
#4102
Conversation
6bc2a31
to
ec62f74
Compare
@lifubang PTAL |
Too early; need to implement support for |
70a96a0
to
5fa0188
Compare
|
||
# Start a few more processes. | ||
for _ in 1 2 3 4 5; do | ||
__runc exec -d test_busybox sleep 1h | ||
[ "$status" -eq 0 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug in the original test case code -- since we're not using bats' run
, there is no $status
to check (essentially, this checks the status from the last run
call which is runc run ...
above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If needed, I can separate this (and other fixes to the test case) to a separate commit, for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to keep it in the same commit. It's a no-op change anyway (the deleted check would never fail anyway for the reason you described).
# Can't mount real /proc when rootless + no pidns, | ||
# so change it to a bind-mounted one from the host. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to explain why this is needed, because I forgot.
runc kill
for containers with no init and no private PID namespacerunc kill
and runc delete
for containers with no init and no private PID namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should drop the last patch. Otherwise, seems okay.
libcontainer/state_linux.go
Outdated
// Normally, when a container init is gone, all other processes in its | ||
// cgroup are killed by the kernel. This is not the case for a shared | ||
// PID namespace container, which may have some leftover processes. | ||
if !b.c.config.Namespaces.IsPrivate(configs.NEWPID) { | ||
// Get container pids on a best-effort basis. | ||
if pids, _ := b.c.cgroupManager.GetAllPids(); len(pids) > 0 { | ||
return fmt.Errorf("container is in stopped state (init is gone), but its cgroup still has %d processes; use runc delete -f or runc kill", len(pids)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I understand the reasoning behind this -- by some definition the container is still "alive" because the pid1 is not special for non-pidns containers. And you cannot runc delete
a container that is is still alive, you need to runc kill
it or use runc delete -f
. So there is an argument that this is more consistent.
However, runc considers the container to be stopped because we bind the lifecycle of the container to the pid1 (as evidenced by runc state
and the fact that this code was added for stoppedState
). From that perspective, it's more consistent that any container which is "stopped" (according to runc state
) should be able to be deleted with runc delete
(no -f
) except in weird circumstances.
IMHO, I think the latter is more consistent, primarily because (in my mind) killing the init process of a container means you've conceptually killed the container by definition (regardless of whether you have a separate pidns) and thus any remaining processes in the container are leftover garbage that runc delete
should be happy to remove. This also has the benefit of not affecting compatibility with pre-1.2 runc. I agree the compatibility concern isn't enormous, because of what an edge-case host pidns containers are, but the difference between this and the shared cgroup stuff is the shared cgroup stuff was categorically broken and thus unusable -- this is a straight-forward change of behaviour, not removal of a broken mis-feature.
I think we should drop this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you mean is that we should drop this part, change to kill the left processes in the stopped container when using runc delete
without -f
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think runc delete
on a container with a dead init process should work without -f
because the container is considered "stopped".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can accept to change back the same as before.
But there is another issue discussed in #4040 (comment) (I can accept fix it in another PR, this PR is only to fix the regression introduced by #3825)
Maybe we have to retry and still have to check whether there are still some processes left in the container or not, even we had try our best to kill them. At this time, we should error out and don't delete anything about this container.
Please see: 120dad4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I wonder if @kolyshkin's suggestion in #4040 (comment) is workable. @kolyshkin is there a reason you didn't go the route of making !c.hasInit()
into !c.hasProcesses()
, so that the container is still in the running
state despite not having a pid1?
My issue with returning an error here is that it makes stopped
containers act strangely in this one scenario. If we treated containers that contain processes at all as running
then we can require people to use runc kill
, which would solve the consistency issue.
I would prefer that over killing the processes in runc delete
(assuming no other issues come up with making a container without the init process act as a "running" container -- I suspect we have a fair amount of code that makes assumptions about that behaviour that aren't ideal), but I would prefer either option over returning an error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the last commit to kill the remaining processes rather than error out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
0c14686
to
7e571fd
Compare
@lifubang i've changed the cgroup removal logic, hopefully fixing related issues. If you have your own repro, can you try it on this branch? |
libcontainer/state_linux.go
Outdated
// to kill those processes here. | ||
if !c.config.Namespaces.IsPrivate(configs.NEWPID) { | ||
_ = signalAllProcesses(c.cgroupManager, unix.SIGKILL) | ||
} | ||
err := c.cgroupManager.Destroy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We say we should not ignore the error from destroy, it means that we have a opportunity to destroy it again once we got an error, but if we ignore this cgroup destroy error, the state directory would be removed in next several lines, so the container disappears in runc, but leave some cgroup paths or some processes of this container in the host, we can’t delete them anymore from Runc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, but I do not modify this code here. I surely can.
Are you suggesting something like this?
if err := c.cgroupManager.Destroy(); err != nil {
return fmt.Errorf("can't remove container cgroup: %w", err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifubang added, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting something like this?
Yes, but I’m puzzled that why we ignored this error before? I will spend sometime to look git log to find out the reason if exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting something like this?
Yes, but I’m puzzled that why we ignored this error before? I will spend sometime to look git log to find out the reason if exists.
Apparently you don't read commit messages 😄 I already looked and wrote about it.
@Burning1020 , PTAL, is there a chance to check this PR has fixed your issue #4040 or not? |
OK. Will check, thanks! |
I think #4047 has been fixed by this PR, but I can't confirm #4040 is fixed by this PR too. 🙏🏻 @Burning1020 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. #4040 fixed, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@cyphar @AkihiroSuda PTAL
The semantics of runType is slightly complicated, and the only place where we need to distinguish between Created and Running is refreshState. Replace runType with simpler hasInit, simplifying its users (except the refreshState, which now figures out on its own whether the container is Created or Running). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Let's use c.hasInit and c.isPaused where needed instead of c.curentStatus for simplicity. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit f8ad20f made it impossible to kill leftover processes in a stopped container that does not have its own PID namespace. In other words, if a container init is gone, it is no longer possible to use `runc kill` to kill the leftover processes. Fix this by moving the check if container init exists to after the special case of handling the container without own PID namespace. While at it, fix the minor issue introduced by commit 9583b3d: if signalAllProcesses is used, there is no need to thaw the container (as freeze/thaw is either done in signalAllProcesses already, or not needed at all). Also, make signalAllProcesses return an error early if the container cgroup does not exist (as it relies on it to do its job). This way, the error message returned is more generic and easier to understand ("container not running" instead of "can't open file"). Finally, add a test case. Fixes: f8ad20f Fixes: 9583b3d Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit f8ad20f moved the kill logic from container destroy to container kill (which is the right thing to do). Alas, it broke the use case of doing "runc delete -f" for a container which does not have its own private PID namespace, when its init process is gone. In this case, some processes may still be running, and runc delete -f should kill them (the same way as "runc kill" does). It does not do that because the container status is "stopped" (as runc considers the container with no init process as stopped), and so we only call "destroy" (which was doing the killing before). The fix is easy: if --force is set, call killContainer no matter what. Add a test case, similar to the one in the previous commit. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The current code is only doing retries in RemovePaths, which is only used for cgroup v1 (cgroup v2 uses RemovePath, which makes no retries). Let's remove all retry logic and logging from RemovePaths, together with: - os.Stat check from RemovePaths (its usage probably made sense before commit 19be8e5 but not after); - error/warning logging from RemovePaths (this was added by commit 19be8e5 in 2020 and so far we've seen no errors other than EBUSY, so reporting the actual error proved to be useless). Add the retry logic to rmdir, and the second retry bool argument. Decrease the initial delay and increase the number of retries from the old implementation so it can take up to ~1 sec before returning EBUSY (was about 0.3 sec). Hopefully, as a result, we'll have less "failed to remove cgroup paths" errors. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If container.Destroy() has failed, runc destroy still return 0, which is wrong and can result in other issues down the line. Let's always return error from destroy in runc delete. For runc checkpoint and runc run, we still treat it as a warning. Co-authored-by: Zhang Tianyang <burning9699@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(For a container with no private PID namespace, that is). When runc delete (or container.Destroy) is called on a stopped container without private PID namespace and there are processes in its cgroup, kill those. Add a test case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For some reason, container destroy operation removes container's state directory even if cgroup removal fails (and then still returns an error). It has been that way since commit 5c246d0, which added cgroup removal. This is problematic because once the container state dir is removed, we no longer know container's cgroup and thus can't remove it. Let's return the error early and fail if cgroup can't be removed. Same for other operations: do not proceed if we fail. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for wrangling this one. On reflection, I agree continuing to treat containers without their init process as "stopped" is the simpler solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit d9f2a24 has a minor issue.
} | ||
return destroy(p.c) | ||
if p.c.hasInit() { | ||
return ErrPaused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when t == Created
, hasInit() returns false:
- Before this commit: It returns ErrPaused.
- After this commit:
p.c.cgroupManager.Freeze(configs.Thawed)
is executed.
libcontainer/container_linux.go
Outdated
switch status { | ||
case Running, Created, Paused: | ||
default: | ||
if !c.hasInit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems also changed the logic about the container status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some problems when you using runc, feel free to open an issue.
This fixes a couple of regressions introduced in #3825, most related to
killing and destroying containers without own private PID namespace,
and with initial container process gone. In this scenario, container is
considered stopped, but there might be some leftover processes in
the container cgroup.
Commit f8ad20f made it impossible to kill those leftover processes.
Fix this by moving the check if container init exists to after the
special case of handling the container without own PID namespace.
runc delete -f
is not killing leftover processes either, ignoring those.runc delete
in runc 1.1 and earier used to kill leftover processes.I think this is a bug, as
delete
(without-f
) is not supposed to killanything. With this PR, such error is reported.
Fix the minor issue introduced by commit 9583b3d:
if signalAllProcesses is used, there is no need to thaw the
container (as freeze/thaw is either done in signalAllProcesses already,
or not needed at all).
Also, a few non-regressions:
make signalAllProcesses return an error early if the container
cgroup does not exist (as it relies on it to do its job). This way, the
error message returned is more generic and easier to understand
("container not running" instead of "can't open file").
Make
runc delete
return with non-zero exit code if the containeris not removed.
runc delete
: do not remove container state if container's cgroup can't be removed(as otherwise we'll lose the container cgroup information forever). Suggested by @lifubang.
Appropriate test cases were added to avoid future regressions.
Fixes #4047.
Fixes #4040.