Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: flaky exit-code when using need-app and lazy-apps flags together(+FIX) #2640

Closed
hananto1 opened this issue May 11, 2024 · 3 comments · Fixed by #2641
Closed

BUG: flaky exit-code when using need-app and lazy-apps flags together(+FIX) #2640

hananto1 opened this issue May 11, 2024 · 3 comments · Fixed by #2641

Comments

@hananto1
Copy link

hananto1 commented May 11, 2024

Description:

I created a Python application using Flask and uWSGI and it is managed by Supervisord.
Sometimes, the application crashes on initialization, because of a disconnection from another service.
In order to let Supervisord know that the application has crashed, I use the flag need-app: true.
So, when the application crashes on initialization, I expect from uWSGI to exit with exit-code 22, and from Supervisord to restart the uWSGI application until it successfully starts.
In order to achieve that I use this flag of Supervisord:

autorestart: unexpected

(Which by default tells Supervidord to restart the application whenever the exit-code is not 0)

This should work well, and actually does work most of the time.

But, the issue is that sometimes uWSGI exits with exit-code 0, even though it crashed on exception and need-app flag is set to true.
This behavior is flaky, the application can exit with 0 imidiately, or after 10,100,200,... successive restarts.

Root Cause Analysis:

I digged a little bit into the code and I think that I found the cause for this issue:

I found two patches that address this area in the code:

From 2014:
Issue: #622
PR for fix: f70c070

This is the code that should handle the issue on core/uwsgi.c in uwsgi_init_all_apps() function:

	// no app initialized and virtualhosting enabled
	if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
		if (uwsgi.need_app) {
			if (!uwsgi.lazy)
				uwsgi_log("*** no app loaded. GAME OVER ***\n");
			if (uwsgi.lazy_apps) {
				if (uwsgi.master_process) {
					if (kill(uwsgi.workers[0].pid, SIGINT)) {
						uwsgi_error("kill()");
					}
				}
			}
			exit(UWSGI_FAILED_APP_CODE);
		}
		else {
			uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
		}
	}

From 2016:
Issue: #1397
PR for fix: 65a8d67

This is the code that should handle the issue on core/master.c in master_loop() function:

// first check failed app loading in need-app mode
if (WIFEXITED(waitpid_status) && WEXITSTATUS(waitpid_status) == UWSGI_FAILED_APP_CODE) {
	if (uwsgi.lazy_apps && uwsgi.need_app) {
		uwsgi_log("OOPS ! failed loading app in worker %d (pid %d)\n", thewid, (int) diedpid);
		uwsgi_log_verbose("need-app requested, destroying the instance...\n");
		uwsgi.status.dying_for_need_app = 1;
		kill_them_all(0);
		continue;
	}
	else {
		uwsgi_log("OOPS ! failed loading app in worker %d (pid %d) :( trying again...\n", thewid, (int) diedpid);
	}
}

We can see that both patches are killing the process, by calling kill_them_all() function. One directly calls kill_them_all() with 0, and the the other by sending SIGINT (2), which is handled in the code and calls kill_them_all() in core/master.c in master_loop() function:

uwsgi_unix_signal(SIGINT, kill_them_all);

As we can see in the issue of 2016 (#1397), the complaint is that the process exits with code 0 instead of 22.
Which is what the 2014 PR (f70c070) caused, but after the newer 2016 PR (65a8d67) was merged it was fixed and now exits with code 22.
BUT, the two PRs live together in the code.
And I think that they are having a race-condition over killing of the process (thats what's causing the flakiness)

Reproduce:

  1. Create a simple python code that throws an exception in main.py (thus should crash on initialization):
raise Exception()

Run it with uwsgi, using these flags:
lazy-apps = true
need-app = true

Here's my full uwsgi.ini:

[uwsgi-app]
auto-procname = true
chdir = /home/app
module = main:wsgi_app
vacuum  = true
die-on-term = true
ignore-sigpipe = true
ignore-write-errors  = true
disable-write-exception = true
buffer-size = 65535
enable-threads = true
lazy-apps = true
worker-reload-mercy = 14400
need-app  = true
socket = /tmp/openresty-uwsgi.sock
socket-timeout = 600
safe-pidfile = /tmp/uwsgi.pid
chmod-socket = 666
  1. Create a script to run the uWSGI application in a loop and print its exit code until it exits with code 0 ( test.sh):
#!/bin/bash
MAX_RETRIES=100000
RETRY_INTERVAL=1
EXIT_CODE=0
for (( i=1; i<=$MAX_RETRIES; i++ )); do
    echo "Attempt $i/$MAX_RETRIES:"
    uwsgi --safe-pidfile /tmp/uwsgi.pid --ini /etc/uwsgi.ini:uwsgi-app --stats 127.0.0.1:1717 --stats-http
    EXIT_CODE=$?

    if [ $EXIT_CODE -eq 0 ]; then
        echo "Process exited successfully"
        break
    else
        echo "Process exited with non-zero exit code: $EXIT_CODE"

        if [ $i -lt $MAX_RETRIES ]; then
            echo "Retrying in $RETRY_INTERVAL seconds..."
            sleep $RETRY_INTERVAL
        else
            echo "Max retries reached. Exiting..."
            break
        fi
    fi
done

exit $EXIT_CODE
  1. Add a print to kill_them_all() function in core/uwsgi.c in-order to debug the calls:
// brutally destroy
void kill_them_all(int signum) {
        uwsgi_log("kill_them_all called %d \n", signum);

        if (uwsgi_instance_is_dying) return;

        uwsgi.status.brutally_destroying = 1;

        // unsubscribe if needed

        uwsgi_unsubscribe_all();

        uwsgi_log("SIGINT/SIGTERM received...killing workers...\n");

        int i;

        for (i = 1; i <= uwsgi.numproc; i++) {
                if (uwsgi.workers[i].pid > 0) {
                        uwsgi_curse(i, SIGINT);
                }
        }

        for (i = 0; i < uwsgi.mules_cnt; i++) {
                if (uwsgi.mules[i].pid > 0) {
                        uwsgi_curse_mule(i, SIGINT);
                }
        }

        uwsgi_destroy_processes();
}
  1. Also add a print to the uwsgi_init_all_apps() function in core/uwsgi.c to debug when this code gets executed:
// no app initialized and virtualhosting enabled
if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
        if (uwsgi.need_app) {
                if (!uwsgi.lazy)
                        uwsgi_log("*** no app loaded. GAME OVER ***\n");

                if (uwsgi.lazy_apps) {
                        if (uwsgi.master_process) {
                                uwsgi_log("Trying to kill twice\n");
                                if (kill(uwsgi.workers[0].pid, SIGINT)) {
                                      uwsgi_error("kill()");
                                }
                        }
                }
                exit(UWSGI_FAILED_APP_CODE);
        }
        else {
                uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
        }
}
  1. Compile uWSGI.

  2. Run the script ./test.sh

First run of the script:

Attempt 1/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49148)
spawned uWSGI worker 1 (pid: 49149, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2 
SIGINT/SIGTERM received...killing workers...
worker 1 buried after 1 seconds
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited successfully

In this run, on the first attempt this process exited with code 0, we can see that kill_them_all() called once by receiving SIGINT (2) by the 2014 PR code.

Second run of the script:

Attempt 3/100000:
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49341)
spawned uWSGI worker 1 (pid: 49342, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2 
SIGINT/SIGTERM received...killing workers...
OOPS ! failed loading app in worker 1 (pid 49342)
Tue May  7 16:05:33 2024 - need-app requested, destroying the instance...
kill_them_all called 0 
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited with non-zero exit code: 22
Retrying in 1 seconds...

Attempt 4/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 49344)
spawned uWSGI worker 1 (pid: 49345, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on 127.0.0.1:1717 fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
Trying to kill twice
kill_them_all called 2 
SIGINT/SIGTERM received...killing workers...
worker 1 buried after 1 seconds
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited successfully

Here we got to the 4th attempt, we can see that in the attempt before exiting with 0, it actually exited with 22 (as expected), and in that case we can see two calls for kill_them_all(), one with 0 (2016 patch) and one with 2 (SIGINT from the 2014 patch).
So, the results show that there ’s some race condition between the two patches over killing the process.
As we can see, when we kill by the first patch (2014) we exit with code 0 which is not the desired behavior, and when we kill by the second (2016) patch we get the desired behavior, the process exit with code 22, so other process managers like Supervisord could restart it in case of failure.

Solution:

Let’s remove the old patch from 2014 (here I commented it), compile, and see what happens:

// no app initialized and virtualhosting enabled
if (uwsgi_apps_cnt == 0 && uwsgi.numproc > 0 && !uwsgi.command_mode) {
        if (uwsgi.need_app) {
                if (!uwsgi.lazy)
                        uwsgi_log("*** no app loaded. GAME OVER ***\n");
                // if (uwsgi.lazy_apps) {
                        // if (uwsgi.master_process) {
                                uwsgi_log("Trying to kill twice\n");
                                //if (kill(uwsgi.workers[0].pid, SIGINT)) {
                                //      uwsgi_error("kill()");
                                //}
                        // }
                // }
                exit(UWSGI_FAILED_APP_CODE);
        }
        else {
                uwsgi_log("*** no app loaded. going in full dynamic mode ***\n");
        }
}

I re-ran the script after compiling uWSGI with the fix, and the process exit-code stayed stable on 22 for all exits as expected.
Here we see 83965 attempts, an attempt every one sec, so for almost 24 hours of successive restats of uWSGI the exit-code stayed 22 for every exit:

Attempt 83965/100000:
...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 61423)
spawned uWSGI worker 1 (pid: 61424, cores: 70)
writing pidfile to /tmp/uwsgi.pid
*** Stats server enabled on [127.0.0.1:1717](http://127.0.0.1:1717/) fd: 9 ***
writing pidfile to /tmp/uwsgi.pid
Traceback (most recent call last):
  File "/home/app/./main.py", line 1, in <module>
    raise Exception()
Exception
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. GAME OVER ***
OOPS ! failed loading app in worker 1 (pid 61424)
Thu May  9 17:26:04 2024 - need-app requested, destroying the instance...
kill_them_all called 0
SIGINT/SIGTERM received...killing workers...
goodbye to uWSGI.
VACUUM: safe pidfile removed.
VACUUM: unix socket /tmp/openresty-uwsgi2.sock removed.
Process exited with non-zero exit code: 22
@hananto1
Copy link
Author

@xrmx can you please help push this fix?

@xrmx
Copy link
Collaborator

xrmx commented May 19, 2024

With which version of uwsgi have you replicated this?

@hananto1
Copy link
Author

hananto1 commented May 19, 2024

@xrmx
I initially encountered this with version 2.0.22.
But when I compiled uWSGI locally and tested it, I pulled the latest code from the "master" branch (1-2 weeks ago).

@xrmx xrmx closed this as completed in #2641 Jun 1, 2024
xrmx added a commit that referenced this issue Jun 1, 2024
Fix for Flaky Exit Code When Using 'need-app' and 'lazy-apps' Flags Together

Fix #2640
xrmx pushed a commit to xrmx/uwsgi that referenced this issue Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants