Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue workers dying due to detecting lost DB connection, when DB is fine. #27053

Closed
iBotPeaches opened this issue Jan 3, 2019 · 5 comments
Closed

Comments

@iBotPeaches
Copy link
Contributor

  • Laravel Version: 5.7.19
  • PHP Version: 7.2
  • Database Driver & Version: MariaDB 10.2

Description:

So consider a queue system that works around webhooks. We cannot control that the receivers' hosts are online and working all the time. We simply send a request, hope for a 200 otherwise exponentially retry. When we encounter a domain that doesn't exist or not responding, we get

production.ERROR: Error creating resource: [message] fopen(): php_network_getaddresses: getaddrinfo failed: Name or service not known

This causes our worker to stop working. This hasn't always been the case, so it was time to dig.

The key thing to notice above is "Name or service not known". Which was added on Aug 22, 2018 -

to the detection of lost connections.

Workers look for those "lost connection" database strings and thus stop the worker - https://github.com/laravel/framework/blob/5.7/src/Illuminate/Queue/Worker.php#L297. However, in this case the string for an invalid domain is the same string as a disconnection of a database. This is causing a large amount of worker restarts, because a DB connection failure kills the worker.

I think the best course of action here is to revert this commit - f4b7494, but open for suggestions so opted for bug report instead of a making a PR to revert that commit.

Steps To Reproduce:

  1. Create queue worker
  2. Make a web request to a fake domain.
  3. Watch worker stop
@GrahamCampbell
Copy link
Member

Thanks for the report. Will this fix your issue: #27054?

@iBotPeaches
Copy link
Contributor Author

Thanks for the quick response and fix, unfortunately doesn't seem to work.

My failure has that exact output in it, so it would continue to cause the worker to be killed. (Reference below of my error)

php_network_getaddresses: getaddrinfo failed: Name or service not known

The fix I think you are aiming for is trying to get a more specific error string from AWS so that my failure wouldn't get detected by it. However, I've never seen that error message from Amazon Aurora so can't contribute to that investigation.

The best bet would require some refactoring, as you could find the PDOException/QueryException and know its connected with the database before comparing exception text.

Or this massive string from the original PR - #25289, but I don't know if that would work.

SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo failed: Name or service not known

@GrahamCampbell
Copy link
Member

5459ac1

@sisve
Copy link
Contributor

sisve commented Jan 4, 2019

I'm not sure I understand the fixes here. The current way of checking the exception message will catch all the failures of getaddrinfo with that message, even if it isn't from Eloquent. No matter how we change the message we're looking for, we'll still be looking for an low-level message that can appear in many code paths. In this case the exception isn't related to the database connection, but a "userland" call to fopen().

It seems reasonable, as @iBotPeaches suggests, to have DetectsLostConnections check the exception type to verify that it's actually related to Eloquent.

@iBotPeaches
Copy link
Contributor Author

Catching up here. Unfortunately the merged fixes don't seem to work. The string (while longer) is still the exact same string that failures from fopen() provide.

The options I see to fix.

  1. Make the error string - SQLSTATE[HY000] [2002] php_network_getaddresses: getaddrinfo failed: Name or service not known, so its specific to database not random fopen errors.

  2. Adapt the DetectsLostConnections check, to also check the exception type looking for a Query exception. If all exceptions that happen at the DB level trigger that, we should be good.

  3. Completely remove the php_network_getaddresses: getaddrinfo failed: Name or service not known from DetectsLostConnections. In my years of hosting at AWS, I've never once seen that error when communicating with my RDS db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants