-
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service start / stop not working correctly #313
Comments
Hmm, I'll look into the start/stop/restarting issue. Haven't seen anything like that myself yet, but I'll get back to you. Re: PID files. By default, Finit expects services to maintain their own PID files. It monitors /run for new files (with some heuristics applied) to sniff new PID files and match them to the PID it recently forked off. PID files are actually the primary synchronization/readiness mechanism, which you've probably already seen traces of. E.g., if service B depends on A you can set a condition like this:
Only trouble is that service A must reassert (touch) it's PID file after the following:
Most daemons today don't rewrite/touch their PID file on SIGHUP, so PID condition A will then remain in "flux" and B will not be restarted properly. I don't know if this could be your problem, since you don't mention how your services are started by Finit, if you use any conditions for instance. But I thought it worth mentioning since I write so crappy docs. Finit has support for creating PID files on its own, as well as figuring out stuff from forking daemons (like was the practise with SysV init scripts): Since most daemons don't behave properly a few other standards for "readiness notification" were invented. For example, in systemd the application is expected to write |
Sorry, cannot reproduce. I'm running Finit master (on myLinux) with the following config for my service:
When I stop it I get the following trace. Worth noting is the log message
When I start it up again I can see Finit registering my wish to enable it:
So ... unfortunately I cannot reproduce your issue, or even speculate further what it could be. Sorry :-/ |
Ah, one thing that might be interesting for you, don't know if you've noticed it yourself, but in your restart case:
Your application exits with exit code 1, that's why it's constantly restarting. |
Sorry I should have stated my service file. It was really simple so I just brushed over it
I didn't like that I saw what looks like a "pid file found" message as part of the stopping so I deleted the pidfile.so plugin from the disk. Now starting/stopping all of my services is reliable. I am not sure if removing this plugin will break my service<->service dependencies. For reference here is a good vs bad daemon log. The service files are identical It seems the good service does a EDIT: Looking at the code I wonder if finit thinks these are forking services? Is that automatically detected? Either way none of these daemons are forking, just a simple mainloop with while(true). Good:
Bad:
As for the restarting bug, I think you are correct and I just missed the fact the daemon was crashing due to me stopping some other services. Whoops. To add to the fire, I didn't mention yesterday that I also see services get stuck in a "running" state (both with and without the pidfile plugin). I did a bit more work and I believe I have found the reason. If myservice is crashing and I call The service shows as "running" with a PID of 0.
Calling stop is basically a
How to recreate the state:
So it looks like doing a service stop also needs cancel the retry timer. |
Some more debugging later and sure enough in No idea how that happens. I'll keep searching |
So these daemons take some time to boot as they rely on certain things. I guess this causes finit to detect they are forking from I am happy with that, no harm done. Except the With my very limited knowledge, I think one of these fixes the issue: (probably missing a bunch of edge cases I don't know about yet) void service_forked(svc_t *svc)
/* Stop forking_retry() timer */
service_timeout_cancel(svc);
- /* Not crashing, restore RUNNING state */
- svc_unblock(svc);
- svc_set_state(svc, SVC_RUNNING_STATE);
+ if (!svc_is_stopping(svc)) {
+ /* Not crashing, restore RUNNING state */
+ svc_unblock(svc);
+ svc_set_state(svc, SVC_RUNNING_STATE);
+ }
} Although I guess the mask check earlier should be different. A closed pid file probably shouldn't trigger a @@ -81,29 +81,37 @@ static void pidfile_update_conds(char *dir, char *name, uint32_t mask)
dbg("Found svc %s for %s with pid %d", svc->name, fn, svc->pid);
mkcond(svc, cond, sizeof(cond));
- if (mask & (IN_CLOSE_WRITE | IN_ATTRIB | IN_MODIFY | IN_MOVED_TO)) {
+ if (mask & (IN_ATTRIB | IN_MODIFY | IN_MOVED_TO)) {
svc_started(svc);
if (!svc_has_pidfile(svc)) {
dbg("Setting %s PID file to %s", svc->name, fn);
pid_file_set(svc, fn, 1);
}
+ Add some sort of dedicated IN_CLOSE_WRITE check later? Perhaps its just part of the IN_DELETE check? |
To fix the restarting/stop race condition I threw in this quick fix: --- a/user/init/src/api.c
+++ b/user/init/src/api.c
@@ -63,6 +63,7 @@ static int stop(svc_t *svc, void *user_data)
if (!svc)
return 1;
+ service_timeout_cancel(svc);
svc_stop(svc);
service_step(svc); Although I wouldn't be surprised if you prefer this cancel logic to be inside the Let us know if I can help further and if you want me to make a pull request, or if you are happy to smash out some (probably much better) fixes. |
This is very interesting, I guess the results above are with latest Finit master? Anyway, I'll be going over this all today (CET) to see if I can reproduce, thank you! If you have any other input on smallest possible setup to reproduce, I'd be most grateful. I try to encode problems like these in test cases, many of them use the stupid test/common/serv.c daemon, which I'll use two of with different Some quick comments/replies to your questions above:
|
Yes. f6b5b02
From my debug prints, it seems like:
Is this an issue? All of our daemons just close the pid file and leave it alone. I think init should clean these up IIRC.
Cheers. You can just credit me directly. I should probably update my github profile since I don't work at Cohda Wireless anymore 😅 I feel like a |
Great, thanks!
Ah right, PID file detection is overloaded with the legacy forking option ... ouch. OK, I'll have a look at this, gives me a better idea also what to focus on when reproducing. The
What do you mean, clean up PID files? If an application creates them it's their responsibility, or their start script (SysV init).
OK 😃
Not following here. But just fyi, there are no more verbosity flags. |
Status update: have reproduced both the Just to clarify, that pidfile.so sets the PID file is just the pidfile auto-discovery mechanism. So that we want to keep, since it's used by the process synchronization with I'll start working on a regression test now to trigger these bugs and then commit fixes to them. Your restart callback cancellation being one of them. |
First set of svc_t properties not possible to debug otherwise. See issue #313 for a background. Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
When a service crashes Finit launches a restart timer that periodically will try restart the service. If a user calls `initctl stop foo` finit must cancel this timer callback, otherwise we may end up with a weird state where Finit thinkgs the service is running, but pid: 0, i.e., not started. Note: this fix has been expanded upon from the proof of concept submitted by Jack Newman. Bug #313 Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
When a service starts up the pidfile.so plugin automatically detects any new pidfiles that show up in the system. If it finds a matching service which does not have a custom configured pidfile, it auto-assigns the detected pidfile to the service. However, when the user later calls `initctl reload` to update the running configuration set, Finit service_register() mistakenly will set the service to forking since it now suddenly has a pidfile ... The new style way of configuring a forking service is to set the type:forking option. The old compat way was to set `pid:!/path/to/pidfile`, so in service_register() we must be very careful not to interpret the registered pidfile as confiugred by the user -- i.e., check the pid option. Bug #313 Signed-off-by: Joachim Wiberg <troglobit@gmail.com>
There, regression test and fixes in place. Hope I nailed all your various issues down ... leaving this issue open until you've had the time to verify. (Don't mind the failed CI run for ❌ 3b601d0, just me having messed up the chroot for tests in Azure. Works now as intended.) |
Perfect! Looks like this issue is solved. Manually starting/stopping/crashing is all good and I can't fault it. 🎉🎉 Thank you Not sure if I should make a new ticket, but I found the same race condition with crashing and conditions.
I can't seem to get this same race condition using the If I do a The log:
|
Great to hear! :-) Yes, please create a new issue with this issue. It's super hard to track different problems in the same issue. I agree with your suggestion in 2. above, and it'll likely solve the problem. |
Awesome, thank you! |
When attempting to stop or restart a service they can get into wonky states where finit doesn't seem to correctly stop or start them again.
Some side information:
All of our daemons run in the foreground
Our daemons create their own PID file as they start. The documentation seems to indicate finit should create/destroy PID files for the services automatically but if I make the daemon write to a different PID file finit seems to automatically detect this as displayed by the pid path in
initctl status myservice
. I attempted to use the forking mode and manually selecting the PID file but it does not seem to fix the issue below.Most of the services that seem to have issues are using wrapper scripts:
Sometimes certain services seem to have trouble starting and stopping.
The alternative is that after an application has crashed, or I have changed runlevels (which has the same not-stopping/starting bug), it seems like I can't get certain services to start again:
Running the application manually works perfectly. Watching ps while finit attempts to restart shows that nothing ever gets started.
This happens on both v4.3 with the restart fix patch and master. I can't remember if I saw this on the pure v4.3 branch
The text was updated successfully, but these errors were encountered: