-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: flaky exit-code when using need-app and lazy-apps flags together(+FIX) #2640
Comments
@xrmx can you please help push this fix? |
With which version of uwsgi have you replicated this? |
@xrmx |
xrmx
added a commit
that referenced
this issue
Jun 1, 2024
Fix for Flaky Exit Code When Using 'need-app' and 'lazy-apps' Flags Together Fix #2640
xrmx
pushed a commit
to xrmx/uwsgi
that referenced
this issue
Jun 1, 2024
…ing need-app and lazy-apps Fix unbit#2640
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description:
I created a Python application using Flask and uWSGI and it is managed by Supervisord.
Sometimes, the application crashes on initialization, because of a disconnection from another service.
In order to let Supervisord know that the application has crashed, I use the flag
need-app: true
.So, when the application crashes on initialization, I expect from uWSGI to exit with exit-code 22, and from Supervisord to restart the uWSGI application until it successfully starts.
In order to achieve that I use this flag of Supervisord:
(Which by default tells Supervidord to restart the application whenever the exit-code is not 0)
This should work well, and actually does work most of the time.
But, the issue is that sometimes uWSGI exits with exit-code 0, even though it crashed on exception and
need-app
flag is set totrue
.This behavior is flaky, the application can exit with 0 imidiately, or after 10,100,200,... successive restarts.
Root Cause Analysis:
I digged a little bit into the code and I think that I found the cause for this issue:
I found two patches that address this area in the code:
From 2014:
Issue: #622
PR for fix: f70c070
This is the code that should handle the issue on
core/uwsgi.c
in uwsgi_init_all_apps() function:From 2016:
Issue: #1397
PR for fix: 65a8d67
This is the code that should handle the issue on
core/master.c
inmaster_loop()
function:We can see that both patches are killing the process, by calling
kill_them_all()
function. One directly callskill_them_all()
with 0, and the the other by sending SIGINT (2), which is handled in the code and callskill_them_all()
incore/master.c
in master_loop() function:As we can see in the issue of 2016 (#1397), the complaint is that the process exits with code 0 instead of 22.
Which is what the 2014 PR (f70c070) caused, but after the newer 2016 PR (65a8d67) was merged it was fixed and now exits with code 22.
BUT, the two PRs live together in the code.
And I think that they are having a race-condition over killing of the process (thats what's causing the flakiness)
Reproduce:
main.py
(thus should crash on initialization):Run it with uwsgi, using these flags:
lazy-apps = true
need-app = true
Here's my full
uwsgi.ini
:test.sh
):kill_them_all()
function incore/uwsgi.c
in-order to debug the calls:uwsgi_init_all_apps()
function incore/uwsgi.c
to debug when this code gets executed:Compile uWSGI.
Run the script
./test.sh
First run of the script:
In this run, on the first attempt this process exited with code 0, we can see that
kill_them_all()
called once by receiving SIGINT (2) by the 2014 PR code.Second run of the script:
Here we got to the 4th attempt, we can see that in the attempt before exiting with 0, it actually exited with 22 (as expected), and in that case we can see two calls for kill_them_all(), one with 0 (2016 patch) and one with 2 (SIGINT from the 2014 patch).
So, the results show that there ’s some race condition between the two patches over killing the process.
As we can see, when we kill by the first patch (2014) we exit with code 0 which is not the desired behavior, and when we kill by the second (2016) patch we get the desired behavior, the process exit with code 22, so other process managers like Supervisord could restart it in case of failure.
Solution:
Let’s remove the old patch from 2014 (here I commented it), compile, and see what happens:
I re-ran the script after compiling uWSGI with the fix, and the process exit-code stayed stable on 22 for all exits as expected.
Here we see 83965 attempts, an attempt every one sec, so for almost 24 hours of successive restats of uWSGI the exit-code stayed 22 for every exit:
The text was updated successfully, but these errors were encountered: