Queue workers dying due to detecting lost DB connection, when DB is fine. #27053

iBotPeaches · 2019-01-03T19:28:48Z

Laravel Version: 5.7.19
PHP Version: 7.2
Database Driver & Version: MariaDB 10.2

Description:

So consider a queue system that works around webhooks. We cannot control that the receivers' hosts are online and working all the time. We simply send a request, hope for a 200 otherwise exponentially retry. When we encounter a domain that doesn't exist or not responding, we get

production.ERROR: Error creating resource: [message] fopen(): php_network_getaddresses: getaddrinfo failed: Name or service not known

This causes our worker to stop working. This hasn't always been the case, so it was time to dig.

The key thing to notice above is "Name or service not known". Which was added on Aug 22, 2018 -

framework/src/Illuminate/Database/DetectsLostConnections.php

Line 37 in b12feab

'Name or service not known',

to the detection of lost connections.

Workers look for those "lost connection" database strings and thus stop the worker - https://github.com/laravel/framework/blob/5.7/src/Illuminate/Queue/Worker.php#L297. However, in this case the string for an invalid domain is the same string as a disconnection of a database. This is causing a large amount of worker restarts, because a DB connection failure kills the worker.

I think the best course of action here is to revert this commit - f4b7494, but open for suggestions so opted for bug report instead of a making a PR to revert that commit.

Steps To Reproduce:

Create queue worker
Make a web request to a fake domain.
Watch worker stop

The text was updated successfully, but these errors were encountered:

GrahamCampbell · 2019-01-03T19:46:55Z

Thanks for the report. Will this fix your issue: #27054?

iBotPeaches · 2019-01-03T19:57:51Z

Thanks for the quick response and fix, unfortunately doesn't seem to work.

My failure has that exact output in it, so it would continue to cause the worker to be killed. (Reference below of my error)

php_network_getaddresses: getaddrinfo failed: Name or service not known

The fix I think you are aiming for is trying to get a more specific error string from AWS so that my failure wouldn't get detected by it. However, I've never seen that error message from Amazon Aurora so can't contribute to that investigation.

The best bet would require some refactoring, as you could find the PDOException/QueryException and know its connected with the database before comparing exception text.

Or this massive string from the original PR - #25289, but I don't know if that would work.

SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo failed: Name or service not known

GrahamCampbell · 2019-01-03T23:05:30Z

5459ac1

sisve · 2019-01-04T14:46:13Z

I'm not sure I understand the fixes here. The current way of checking the exception message will catch all the failures of getaddrinfo with that message, even if it isn't from Eloquent. No matter how we change the message we're looking for, we'll still be looking for an low-level message that can appear in many code paths. In this case the exception isn't related to the database connection, but a "userland" call to fopen().

It seems reasonable, as @iBotPeaches suggests, to have DetectsLostConnections check the exception type to verify that it's actually related to Eloquent.

iBotPeaches · 2019-01-04T15:12:38Z

Catching up here. Unfortunately the merged fixes don't seem to work. The string (while longer) is still the exact same string that failures from fopen() provide.

The options I see to fix.

Make the error string - SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo failed: Name or service not known, so its specific to database not random fopen errors.
Adapt the DetectsLostConnections check, to also check the exception type looking for a Query exception. If all exceptions that happen at the DB level trigger that, we should be good.
Completely remove the php_network_getaddresses: getaddrinfo failed: Name or service not known from DetectsLostConnections. In my years of hosting at AWS, I've never once seen that error when communicating with my RDS db.

GrahamCampbell mentioned this issue Jan 3, 2019

[5.7] Stricter error message in place of "Name or service not known" #27054

Merged

taylorotwell closed this as completed in #27054 Jan 3, 2019

iBotPeaches mentioned this issue Feb 4, 2019

[5.7] Revert "Handle AWS Connection Lost (#25295)" #27418

Merged

IanBrison mentioned this issue Apr 25, 2020

[7.x] Add PDOException's try again as a lost connection message #32544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue workers dying due to detecting lost DB connection, when DB is fine. #27053

Queue workers dying due to detecting lost DB connection, when DB is fine. #27053

iBotPeaches commented Jan 3, 2019

GrahamCampbell commented Jan 3, 2019

iBotPeaches commented Jan 3, 2019

GrahamCampbell commented Jan 3, 2019

sisve commented Jan 4, 2019

iBotPeaches commented Jan 4, 2019

Queue workers dying due to detecting lost DB connection, when DB is fine. #27053

Queue workers dying due to detecting lost DB connection, when DB is fine. #27053

Comments

iBotPeaches commented Jan 3, 2019

Description:

Steps To Reproduce:

GrahamCampbell commented Jan 3, 2019

iBotPeaches commented Jan 3, 2019

GrahamCampbell commented Jan 3, 2019

sisve commented Jan 4, 2019

iBotPeaches commented Jan 4, 2019