Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service restart delay not working #310

Closed
JackNewman12 opened this issue Nov 1, 2022 · 11 comments
Closed

Service restart delay not working #310

JackNewman12 opened this issue Nov 1, 2022 · 11 comments

Comments

@JackNewman12
Copy link
Contributor

When a service is crashing I find that finit seems to restart the app instantly.
This means all 10 retries can end up being used in a few milliseconds and ends up disabling the service.

Interestingly I can still see the event timer is running as X seconds later finit will print the "Successfully restarted" line.

The conf file is very simple:

service /sbin/MYSERVICE

Sorry I didn't have timestamps on the log, hopefully my comments are clear enough. Tomorrow I should get a chance to recapture this with timestamps.

Thanks!

finit: sigchld_cb():Collected child 1225
finit: service_monitor():collected MYSERVICE(1225), normal exit: 0, signaled: 1, exit code: 0
finit: service_step():           MYSERVICE(   0):  running  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/restart
finit: cond_set_oneshot_noupdate():service/MYSERVICE/restart => /var/run/finit/cond/service/MYSERVICE/restart
finit: cond_update():service/MYSERVICE/restart
finit: service_step():delayed restart of MYSERVICE
finit: service_step():           MYSERVICE(   0): ->  restart
finit: service_step():           MYSERVICE(   0):  restart disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: service_retry():MYSERVICE crashed, trying to start it again, attempt 5
finit: Service MYSERVICE[1225] died, restarting (5/10)
finit: service_step():           MYSERVICE(   0):   halted  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: service_step():           MYSERVICE(   0): ->    ready
finit: service_step():           MYSERVICE(   0):    ready  enabled/clean   cond:on

...app ends up restarting instantly ...

finit: service_start():Starting /sbin/MYSERVICE
finit: Starting MYSERVICE[1628]
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: service_step():           MYSERVICE(1628): ->  running
finit: service_step():           MYSERVICE(1628):  running  enabled/clean   cond:on


... Real delay of about 2 seconds ...


finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: Successfully restarted crashing service MYSERVICE.


... waiting some time and manually killing app again ...

finit: sigchld_cb():Collected child 1628
finit: service_monitor():collected MYSERVICE(1628), normal exit: 0, signaled: 1, exit code: 0
finit: service_step():           MYSERVICE(   0):  running  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/restart
finit: cond_set_oneshot_noupdate():service/MYSERVICE/restart => /var/run/finit/cond/service/MYSERVICE/restart
finit: cond_update():service/MYSERVICE/restart
finit: service_step():delayed restart of MYSERVICE
finit: service_step():           MYSERVICE(   0): ->  restart
finit: service_step():           MYSERVICE(   0):  restart disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: service_retry():MYSERVICE crashed, trying to start it again, attempt 6
finit: Service MYSERVICE[1628] died, restarting (6/10)
finit: service_step():           MYSERVICE(   0):   halted  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: service_step():           MYSERVICE(   0): ->    ready
finit: service_step():           MYSERVICE(   0):    ready  enabled/clean   cond:on

... app restarts instantly ...

finit: service_start():Starting /sbin/MYSERVICE
finit: Starting MYSERVICE[1881]
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: service_step():           MYSERVICE(1881): ->  running
finit: service_step():           MYSERVICE(1881):  running  enabled/clean   cond:on


... manually killing the app again in less than 5 seconds ...


finit: sigchld_cb():Collected child 1881
finit: service_monitor():collected MYSERVICE(1881), normal exit: 0, signaled: 1, exit code: 0
finit: service_step():           MYSERVICE(   0):  running  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/restart
finit: cond_set_oneshot_noupdate():service/MYSERVICE/restart => /var/run/finit/cond/service/MYSERVICE/restart
finit: cond_update():service/MYSERVICE/restart
finit: service_step():delayed restart of MYSERVICE
finit: service_step():           MYSERVICE(   0): ->  restart
finit: service_step():           MYSERVICE(   0):  restart disabled/clean   cond:on
finit: sm_step():state: running, runlevel: 2, newlevel: -1, teardown: 0, reload: 0
finit: service_retry():MYSERVICE crashed, trying to start it again, attempt 7
finit: Service MYSERVICE[1881] died, restarting (7/10)
finit: service_step():           MYSERVICE(   0):   halted  enabled/clean   cond:on
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: service_step():           MYSERVICE(   0): ->    ready
finit: service_step():           MYSERVICE(   0):    ready  enabled/clean   cond:on
finit: service_start():Starting /sbin/MYSERVICE

... app restarts instantly ...

finit: Starting MYSERVICE[1889]
finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: service_step():           MYSERVICE(1889): ->  running
finit: service_step():           MYSERVICE(1889):  running  enabled/clean   cond:on

... Real delay of about 5 seconds ...

finit: cond_clear():service/MYSERVICE/
finit: cond_clear_noupdate():service/MYSERVICE/
finit: cond_set_path():/var/run/finit/cond/service/MYSERVICE/ <= 0
finit: cond_set_oneshot():service/MYSERVICE/running
finit: cond_set_oneshot_noupdate():service/MYSERVICE/running => /var/run/finit/cond/service/MYSERVICE/running
finit: cond_update():service/MYSERVICE/running
finit: Successfully restarted crashing service MYSERVICE.
@troglobit
Copy link
Owner

Interesting! I've been troubleshooting a very similar issue.

What version of Finit are you using, or are you on the bleeding edge as myself?

@JackNewman12
Copy link
Contributor Author

The latest release v4.3.

I must note that I had to create the /var/run/finit/cond/usr folder myself to get usr conditions working. I wonder if something similar is happening with service folder. I'll have a peek today.

Any thoughts as to where I should be looking or extra debug to enable?

@troglobit
Copy link
Owner

OK. There are changes related to this coming up in v4.4 that I hope to release soon.

The usr conditions folder is created by the initctl tool, those conditions aren't really supposed to be managed by anything else.

Before we rush ahead with going into the code. Maybe you can tell me a little bit more about MYSERVICE and what you expect Finit to do, and also what are you doing from the outside? Is it a forking daemon, is it expected to crash (I see the signaled: 1 flag going up), or are you in some of the cases above killing the daemon to test it or Finit's restart mechanism?

The particular code in question starts here:

finit/src/service.c

Lines 1970 to 1981 in f483a8a

/*
* Restart directly after the first crash, except for forking services
* which we need to wait for the forked-off child to create its pid
* file. In both cases, after that, retry after 2 sec
*/
if (!svc->respawn)
_d("delayed restart of %s", svc_ident(svc, NULL, 0));
if (svc_is_forking(svc))
service_timeout_after(svc, 2000, service_retry);
else
service_timeout_after(svc, 1, service_retry);
break;

@JackNewman12
Copy link
Contributor Author

JackNewman12 commented Nov 2, 2022

The usr conditions folder is created by the initctl tool, those conditions aren't really supposed to be managed by anything else.

Yes, although I would expect it to create that folder when required instead of failing. Adding the -create flag to the tool does not fix this. Either way adding a mkdir to a startup script is an easy fix.

Maybe you can tell me a little bit more about MYSERVICE

This bug actually applies to all of our daemons. I've been attempting to switch our (very old) init system over the finit to give us some more powerful features. In this case you can imagine a simple daemon that just sits there and does some Tx/Rx data from Ethernet/Serial/Etc. No forking. It was only when I noticed an application was crashing (i.e. the network was down and we didn't handle it) all 10 retries would happen instantly.

and what you expect Finit to do.

For 99% of cases if the daemon is not running, restart it (with a small delay). Nothing too fancy going on, although with finit I expect we will add some extra conditions like network up/down, service to service dependency, etc.

is it expected to crash (I see the signaled: 1 flag going up), or are you in some of the cases above killing the daemon to test it or Finit's restart mechanism?

In the example above I was just killing the daemon manually using kill MYSERVICE to test Finit's restart mechanism.

I didn't have much time today but it seems to do what I would expect. A service_timeout(service_retry) gets set for X seconds in future but the bottom of the service_step() loop causes it to keep jumping into the next state running-> restart -> halted -> ready.
I would have expected once this callback gets set that it returns early so that the callback ends up doing the work.

finit/src/service.c

Lines 2063 to 2067 in f483a8a

if (svc->state != old_state) {
_d("%20s(%4d): -> %8s", svc_ident(svc, NULL, 0), svc->pid, svc_status(svc));
changed++;
goto restart;
}

@troglobit
Copy link
Owner

Thanks for the quick follow-up, was not expecting that! Very much appreciate some background and where-you're-at status 😃

Be aware that Finit use a lot of fairly modern kernel features sometimes people run into issues because of having an older kernel or missing some kernel config options.

OK, I'll look into the usr condition thing as a separate issue if I can reproduce it.

The initial delay, like you saw, very short. But then it should work as you say ... the service_retry() callback should handle the delay. Unless my mind's slipping and I've forgotten something obvious. I'll have a look at it during the day to see what I can find out.

@JackNewman12
Copy link
Contributor Author

Be aware that Finit use a lot of fairly modern kernel features sometimes people run into issues because of having an older kernel or missing some kernel config options.

We are using 5.4 IIRC. Is there a list of these kernel features I need to enable? We run a pretty minimal build.
I don't see any errors from finit at boot, except missing support for cgroupv2.

Thanks for the support!

@troglobit
Copy link
Owner

That remains to be documented unfortunately, but all the new eventfd/signalfd/etc. and DEVTMPFS ... the best I can do meanwhile is to give you this: https://github.com/troglobit/myLinux/blob/main/board/amd64/linux_defconfig

Finit can take care of bootstrapping a pretty bare rootfs using, e.g., the bootmisc.so plugin (many plugins started out as optional but are now mandatory, this is one such). There are some more pointers in myLinux if you're curious.

and https://github.com/troglobit/myLinux/blob/main/board/common/busybox_defconfig, if you're on an embedded system with BusyBox. Notice BusyBox must not be built with CONFIG_MODPROBE_SMALL.

@troglobit
Copy link
Owner

I believe I have reproduced this issue now. There seems to be at least two cooperating bugs at play here. I'll try to get to the bottom of this over the weekend.

@troglobit troglobit changed the title Service restart timeout not working Service restart delay not working Nov 6, 2022
@troglobit
Copy link
Owner

There, finally found the root cause(s) of this! Thank you for taking the time to report it.

@JackNewman12
Copy link
Contributor Author

Thanks for fixing the restart issue! I managed to test it today and it is working.

If I kill an application it will restart instantly, however if I kill it again within a few seconds it will get the expected delay.
Waiting for 5 seconds and killing the app will have it restart instantly again.

@troglobit
Copy link
Owner

Thank you for taking the time to test again! :)

Yeah, this is by design. We don't want to penalize a single crash, which could be attributed to startup issues. The intention is to increment the delay if the service crashes continuously -- so we track the number of crashes (per some period of time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants