Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when service is crashed and restarted initctl shows wrong pid #226

Closed
hongkongkiwi opened this issue Feb 11, 2022 · 4 comments
Closed
Assignees
Milestone

Comments

@hongkongkiwi
Copy link
Contributor

hongkongkiwi commented Feb 11, 2022

I have the following service:

service [12345789] name:myapp :media pid:!myapp:media /usr/sbin/myapp-media -P /run/myapp:media.pid -- Media daemon

Using initctl status myapp:media gives the correct result.

However, if I kill the process using the pid provided above (I am simulating a crash), then the app is restarted by finit. However after the app is restarted, the pid in initctl status is not updated.

It's correct in the pid file because that's managed by my service, it just seems like it doesn't reread that (or update the internal db) when restarting the service.

This is a problem for me because I'm using my workaround command in #225 to send signals and I would prefer to have initctl tell me the correct pid than to read the pid file in my own script because then I have to have knowledge of what the pid file name is.

troglobit added a commit that referenced this issue Feb 12, 2022
We extend the service.sh to emulate a well behaving service that creates
and removes its own PID file.

The new test emulates a crash by sending SIGKILL to service.sh.  We then
verify that Finit restarts it, and eventually registers the new PID when
the services recreates the PID file.

Issue #226

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
@troglobit
Copy link
Owner

I'm afraid this must be your kernel again. I just added a new test¹ for this particular case and I cannot reproduce the problem. Allow me to explain a little more about the monitoring in Finit; when a well-behaved service (A) starts up in the foreground, Finit knows it's PID, but to be able to safely start any depending services (B and C) it waits for the service (A) to create it's PID file. Finit reads all PID files created in /run on every inotify event from the kernel. If it finds the PID it waits for, in the expected PID file, the service's (A) pid condition is asserted.

Hence, if inotify is not working properly that mechanism is broken. There may be unexpected behavior/artifacts in internal structs when this occurs, e.g. wrong PID shown etc.

__
¹ the first run failed because I forgot to add the testcase to EXTRA_DIST. Here's a link to the second run: https://github.com/troglobit/finit/runs/5167392996?check_suite_focus=true#step:7:612

troglobit added a commit that referenced this issue Mar 20, 2022
At startup (and reconf) of systems with lots of services there is a risk
of losing inotify events, e.g., PID file creation/delete events.  This
patch increase the receive buffer (doubles it).

On Linux the getsockopt() for SO_RCVBUF returns double the set size, due
to housekeeping in the kernel.  So we don't have to do any adjustments
when setting it.

Issue #226

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
@troglobit
Copy link
Owner

So, I have to retract my previous statement ... I got a very similar report (privately) from a client. They had spotted a behavior just as you described, but with dnsmasq, when reconfiguring their system at runtime. Finit refused to restart the dying service, hanging on to its old PID.

I've been attempting to recreate this problem using the test case I mentioned previously; start-kill-service.sh. It's been really hard ... that is until I increased the number of laps I stick to the kill/restart sequence from 1000 to 100000! Turns out I get the same behavior from anywhere around lap 2000 to 74000, it's not been consistent at all.

I've had several theories over the last few weeks, but none have really panned out until this morning when I managed to enable logging in a reasonable way and found -- that Finit does indeed detect the PID crashing (so signals aren't lost), but it thinks the process is a forking service (sysv start script) and exits early waiting for the daemon/script to create its PID file ...

Tweaking the classification of what is a forking service seems to be the solution. I've now rerun the test (100000 laps) twice without a problem! So I'll be adding some more tests to also verify forking services with this tweak, but it's looking very promising.

Thanks for reporting this, and sorry for my being so dismissive earlier!

@hongkongkiwi
Copy link
Contributor Author

No problem, thank you so much for further investigating this. I too thought it may just been some strange bug in earlier inotify implementations on my kernel.

I did think it was a bit strange is I have used another inotify daemon implementation on my system and with the tiny patch I mentioned in another thread, I'm able to have (I think) very reliable pid detection including delete, update, create etc so that's why I was a bit confused.

troglobit added a commit that referenced this issue Mar 25, 2022
The slay script checks for common errors and gives some logs and status
of Finit when something goes wrong.  Helps detect issue #226 when the
start-kill-service.sh test runs at 100000 laps.

Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
@troglobit troglobit added this to the 4.3 milestone Apr 8, 2022
@troglobit troglobit self-assigned this Apr 8, 2022
@troglobit
Copy link
Owner

This should now be fixed. Relevant major commits: ba77e4f, 20290a4, a39d958

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants