-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue workers dying due to detecting lost DB connection, when DB is fine. #27053
Comments
Thanks for the report. Will this fix your issue: #27054? |
Thanks for the quick response and fix, unfortunately doesn't seem to work. My failure has that exact output in it, so it would continue to cause the worker to be killed. (Reference below of my error)
The fix I think you are aiming for is trying to get a more specific error string from AWS so that my failure wouldn't get detected by it. However, I've never seen that error message from Amazon Aurora so can't contribute to that investigation. The best bet would require some refactoring, as you could find the PDOException/QueryException and know its connected with the database before comparing exception text. Or this massive string from the original PR - #25289, but I don't know if that would work.
|
I'm not sure I understand the fixes here. The current way of checking the exception message will catch all the failures of getaddrinfo with that message, even if it isn't from Eloquent. No matter how we change the message we're looking for, we'll still be looking for an low-level message that can appear in many code paths. In this case the exception isn't related to the database connection, but a "userland" call to fopen(). It seems reasonable, as @iBotPeaches suggests, to have DetectsLostConnections check the exception type to verify that it's actually related to Eloquent. |
Catching up here. Unfortunately the merged fixes don't seem to work. The string (while longer) is still the exact same string that failures from The options I see to fix.
|
Description:
So consider a queue system that works around webhooks. We cannot control that the receivers' hosts are online and working all the time. We simply send a request, hope for a 200 otherwise exponentially retry. When we encounter a domain that doesn't exist or not responding, we get
This causes our worker to stop working. This hasn't always been the case, so it was time to dig.
The key thing to notice above is "Name or service not known". Which was added on Aug 22, 2018 -
framework/src/Illuminate/Database/DetectsLostConnections.php
Line 37 in b12feab
Workers look for those "lost connection" database strings and thus stop the worker - https://github.com/laravel/framework/blob/5.7/src/Illuminate/Queue/Worker.php#L297. However, in this case the string for an invalid domain is the same string as a disconnection of a database. This is causing a large amount of worker restarts, because a DB connection failure kills the worker.
I think the best course of action here is to revert this commit - f4b7494, but open for suggestions so opted for bug report instead of a making a PR to revert that commit.
Steps To Reproduce:
The text was updated successfully, but these errors were encountered: