Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebooting permanently stalls if finit shows [WARN] for an app when attempting to kill #227

Closed
hongkongkiwi opened this issue Feb 11, 2022 · 7 comments
Milestone

Comments

@hongkongkiwi
Copy link
Contributor

hongkongkiwi commented Feb 11, 2022

When rebooting, if a service has a "[WARN]" status the reboot never completes.

I was testing killing an app using kill <pid> and having finit to restart the app. finit doesn't seem to pick up the correct pid when this happens see bug #226

When this situation happens, I guess that finit gets "out of sync", so when doing:
finit 6 to reboot, it stalls on the above app:

# finit 6
[FAIL] Saving sound settings
[FAIL] Saving random seed
[ OK ] Stopping System log daemon
[ OK ] Stopping Kernel log daemon
[ OK ] Stopping Chrony Time daemon
[WARN] Killing MyApp Media daemon

By stall, I mean it sits forever on the [WARN] line.

In normal cases finit 6 works totally fine as long as it can kill this app, but any "[WARN]" line seems to halt the rebooting process permanently (no matter how long I wait).

@hongkongkiwi hongkongkiwi changed the title Rebooting stalls if finit shows [WARN] for an app when attempting to kill Rebooting permanently stalls if finit shows [WARN] for an app when attempting to kill Feb 11, 2022
@hongkongkiwi
Copy link
Contributor Author

Just to show this is not a fluke, for some reason dbus had the [WARN] status and the same situation happened:

# finit 6
[FAIL] Saving sound settings
[ OK ] Saving random seed
[FAIL] Stopping D-Bus message bus daemon
[WARN] Killing D-Bus message bus daemon

It will halt at this condition forever.

@troglobit
Copy link
Owner

troglobit commented Feb 12, 2022

Interesting, I'll have a look at this in detail and try to set up a testcase for it. We just had a PR for shutdown/kill so there might be a regression.

Just to make sure, which version of Finit are you running; the latest release, or a GIT version? (The PR I mentioned above is not released yet.)

@troglobit
Copy link
Owner

Progress: so far I've only been able to reproduce the [WARN], but for me the system reboots fine. I'm starting to suspect it's not the stopping of services that's at fault, but rather something else. Could you try calling initctl debug before initctl reboot?

[ OK ] Stopping Web interface
[WARN] Killing Simple NTP daemon
[    9.157661] reboot: Restarting system

@troglobit
Copy link
Owner

So, the fix to this issue in 7dc7f9a handles the "stall" in reboot. The actual root cause, which you hinted to, really seems to be #226. See that issue for an update on that as well.

@hongkongkiwi
Copy link
Contributor Author

Oh that's great, sorry I didn't get a debug log earlier, we are doing some system porting and I had to switch (temporarily) to another project. I'm really glad to were able to find the cause of this, we are on an embedded platform, so having it not behave as expected when shutting down was quite challenging.

This was a little bit inconsistent for me to replicate, but I'll try the latest version. Thanks for the fix!

@troglobit
Copy link
Owner

Yeah, I'm mostly on embedded systems as well, and reboot must always work. Hope it works better also for you :)

@troglobit
Copy link
Owner

Reopening, I just ran into this one myself trying to reboot and found the following:

...
finit[1]: service_kill():(null): Sending SIGKILL to process group 2577
finit[1]: Stopping pod:system[2577], sending SIGKILL ...
[WARN] Killing System container
...

After which everything just hung forever.

The interesting bit is the (null) above, it's from an internal function that looks in /proc/2577/status after the actual process name. Here it could not find one, and the only way for that function to fail is if 2577 no longer exists!

Analysis

For my use-case pod:system[2577] is a podman container, which as it turns out, starts conmon to monitor the container. However, the PID 2577 that it returned in the container pidfile was for that system's init process, not conmon itself. conmon is a process monitor and sub-reaper, hence Finit never got any feedback to proceed and the service_kill() function exited early leaving Finit to wait forever ...

@troglobit troglobit reopened this Jun 27, 2023
@troglobit troglobit added this to the 4.5 milestone Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants