Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init.krun does not reap zombie processes #189

Closed
teohhanhui opened this issue May 29, 2024 · 17 comments · Fixed by #190
Closed

init.krun does not reap zombie processes #189

teohhanhui opened this issue May 29, 2024 · 17 comments · Fixed by #190

Comments

@teohhanhui
Copy link
Contributor

(Moved from https://github.com/slp/krun/issues/16)

@asahilina:

Zombie processes inside the VM never get reaped, I guess because init.krun doesn't do this. Although we obviously don't need a full-blown init system, it should at least do this.

Is there some documentation of what PID 1 is expected to do?

Relevant writeups:

@slp
Copy link
Contributor

slp commented May 29, 2024

This is a regression I've also noticed last week. I have a fix here, will create a PR later.

slp added a commit to slp/libkrun that referenced this issue May 30, 2024
Tell the kernel that we want to ignore SIGCHLD so it'll reap our
children for us to avoid leaving zombie objects.

Fixes containers#189

Signed-off-by: Sergio Lopez <slp@redhat.com>
@slp slp closed this as completed in #190 May 30, 2024
slp added a commit that referenced this issue May 30, 2024
Tell the kernel that we want to ignore SIGCHLD so it'll reap our
children for us to avoid leaving zombie objects.

Fixes #189

Signed-off-by: Sergio Lopez <slp@redhat.com>
@ericcurtin
Copy link

This issue came up in my feed. Never realized krun has it's own init system. FWIW systemd is as fast as written from scratch init systems if it's trimmed down like the one in this dracut module-setup.sh file:

https://gitlab.com/CentOS/automotive/rpms/dracut-automotive/-/blob/main/module-setup.sh

I've written an init system more than once, but when compared it to these mini systemd setups, it's typically just as fast and less maintenance to use systemd, with some custom binaries for specific tasks as systemd .service files.

Now the fat complicated systemd configuration from Fedora isn't so fast, but that's just because by default there's like a million features and .service units configured, when only around 5 units are actually needed.

@ericcurtin
Copy link

ericcurtin commented May 30, 2024

The above systemd is configured for initramfs, but just giving an example of how lean systemd can be.

I am glad I wrote an init system at least once for the learning experience though...

@ericcurtin
Copy link

I'm guessing its because we want a statically compiled init system for all OSes 🤔

@asahilina
Copy link
Contributor

Since krun shares the filesystem with the host OS, I don't think systemd is a good idea... I don't think anyone tests multiple instances of systemd that both think they are PID 1 but share a filesystem, that sounds like a recipe for trouble...

@ericcurtin
Copy link

ericcurtin commented Jun 2, 2024

Since krun shares the filesystem with the host OS, I don't think systemd is a good idea... I don't think anyone tests multiple instances of systemd that both think they are PID 1 but share a filesystem, that sounds like a recipe for trouble...

This is one of the reasons podman was created, multiple instances of systemd, sharing filesystems, etc. this use case is regularly tested and deployed.

I'm not saying this is why we should or shouldn't use systemd but saying nobody tests this is just not true.

systemd can be as simple as a binary that acts as a process manager that forks other processes, almost every feature in systemd is optional (and at run-time, just change around the systemd unit files to do what is desired).

@ericcurtin
Copy link

systemd-nspawn does this kinda thing also but I'm more familiar with podman.

@asahilina
Copy link
Contributor

I thought the whole point of containers was that they run with a different filesystem (root)? We run with the same filesystem root.

$ sudo systemd-nspawn -D /
Spawning container on root directory is not supported. Consider using --ephemeral, --volatile=yes or --volatile=state.

Evidently the systemd people don't think this is supposed to work.

@ericcurtin
Copy link

Yup and the podman equivalent is:

sudo podman run -ti --rootfs /:O /bin/bash

sudo podman run -ti --systemd=true --rootfs /:O /usr/lib/systemd/systemd

I do question the approach of sharing the whole root with both OSes, in the vast majority of VM/container solutions, it's pick and share what you need rather than share everything, even if the "share what you need" ends up being 80% of the hosts contents. I also think we'd re-implement less this way.

But if we want to try something unique, why not I guess :)

This is very loosely related to other conversations going on at the moment, again the angle is more towards ephemeral containers though:

https://gitlab.com/fedora/bootc/tracker/-/issues/4

@ericcurtin
Copy link

Like for example if we were desigining systemd to be run in a microVM, it's basically just ensuring you don't include certain directories and populating those with alternate configs, etc. this kind of thing.

@ericcurtin
Copy link

A bootc image for inside the microVM could be very well suited for this use-case eventually also, even on a non-booc OS.

@ericcurtin
Copy link

It also wouldn't be a bad idea to reduce the pressure on virtofs tbh.

@ericcurtin
Copy link

But don't get me wrong, I'd be happy to see this approach continue with virtiofs by default. It's novel and it could be interesting to see a project like this one try something completely different.

@ericcurtin
Copy link

ericcurtin commented Jun 2, 2024

This way of using PID1 for example, doesn't initialize selinux in the guest kernel, now for this use-case maybe some people don't care. But some people care deeply about having selinux on in all running kernel instances.

If we had a systemd binary in there, we could pick whether to initialize selinux or not.

@asahilina
Copy link
Contributor

sudo podman run -ti --systemd=true --rootfs /:O /usr/lib/systemd/systemd

That doesn't actually run systemd, it just sets up the environment to run systemd.

It also doesn't actually use the host filesystem, instead it sets up an overlayfs. If you try to actually use the host FS:

$ sudo podman run -ti --rootfs / /bin/bash
Error: OCI runtime error: crun: pivot_root: Device or resource busy

So it doesn't work. So nobody can be testing this, by definition. It really is a very, very different usecase to all the container stuff people do.

@ericcurtin
Copy link

ericcurtin commented Jun 2, 2024

Yeah I understand, it was the wording of this was open to interpretation:

I don't think anyone tests multiple instances of systemd that both think they are PID 1 but share a filesystem

I read this and instantly got surprised as people do this all the time, but then you clarified with filesystem root, just a misunderstanding.

It wouldn't make sense to have exactly the same root filesystem anyway using two systemd's, the guest one would ideally have a modified /usr, /etc, etc. to trim it down so it's minimised.

I assumed we were gonna do things more like the ChromeOS Linux environment approach, but I guess not, interesting to see where this goes :)

@ericcurtin
Copy link

ericcurtin commented Jun 2, 2024

sudo podman run -ti --systemd=true --rootfs /:O /usr/lib/systemd/systemd

That doesn't actually run systemd, it just sets up the environment to run systemd.

This does run systemd, it sets up systemd and runs it, the /use/lib/systemd/systemd part exec's systemd.

It also doesn't actually use the host filesystem, instead it sets up an overlayfs. If you try to actually use the host FS:

$ sudo podman run -ti --rootfs / /bin/bash
Error: OCI runtime error: crun: pivot_root: Device or resource busy

This is interesting, this is supposed to work, pivot_rooting to yourself is weird though, this should probably be logged as a bug, if anyone cares about this feature.

So it doesn't work. So nobody can be testing this, by definition. It really is a very, very different usecase to all the container stuff people do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants