-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939
Comments
So what is the solution? |
Just throwing ideas out there, can we use MULTI and EXEC? |
Seems like our only option would be to go full Lua script with it. |
Beside documentation, as I mentioned here, the code to fix will need to use |
Please take a look at https://redis.io/commands/rpoplpush#pattern-reliable-queue |
Ah, hmm, we may need to back this out then if we can't make it reliable. Any thoughts? If we need to use |
I personally feel like data loss on a queue is going to be critical in basically 99% of applications. |
If we change the ordering of the queues we should probably change both the blocking and the non-blocking order. I don't think that we should require people to drain their queues just to try out the new features. I don't see a huge problem with changing the order of the lists between 5.5 and 5.6. Sure, I can make up some scenarios where it will cause problems, but those involve a mix of Laravel versions pushing and popping to the queue, and not enough workers to ever drain it. And that is a totally different problem anyway. I think it's enough to document that we change the order of the redis queue in the upgrade guide. It's an implementation detail and only affects people if they have jobs in the queue when upgrading. |
Helper scripts can be provided to do the migration and reverse the orders. |
Looks like I'll just need to document this for 5.6 release I guess. |
Added this warning for now: @halaei do you intend to work on fixing the potential data loss issue or should others try to take a look at it? |
I work on it and send a PR in 2 days. |
An update;
All these three factors, together, becomes a growing problem of reported lost/paused jobs in Horizon. On an unrelated note, if we're using blocking_for=0 (or any high value), how does that work with the SIGTERM handling to cleanly shut down a worker? I imagine that the signal is sent, but it isn't processed until the BLPOP returns. However, if this takes too long time, then supervisor would have issued a SIGKILL to forcefully shut down the worker.
There are several questions about the current implementation, but there is a new PR that uses BRPOPLPUSH at #23057 that may fix this. I'm not knowledgeable enough about BRPOPLPUSH to evaluate that approach, do we have any takers that can review that PR and add some input? |
I wrote a Redis module to fix this issue and add some other values. Please check lqrm and its php driver. Here is an intro digest:
Will be great if you give it a try and/or send feedback - it might still have some bugs though. |
There is a pretty simple solution that I think will work. It's explained in the redis docs here. Basically when we push a job we also push a notification key:
The worker does a blocking pop on the notification key. Once the notification key returns an element we know a job is ready. Once a job is ready we use the existing Lua script and get all of the same atomicity guarantees.
In this example There is a chance the worker might crash between popping the notification element and popping the job onto the reserved queue. That is ok - once the worker restarts it will attempt to pop the job again and nothing important is lost. The only downside I can think of with this approach is it will take some care when upgrading. If you update the workers before the producers they will be blocking waiting for a notification key that is never set. The worker will eventually work the job once |
This will be fixed in 5.8: #27336 Thanks for everyone's help with this! |
Description:
The new redis blocking pop functionality introduced in #22284 has issues if there are problems between the BLPOP and the ZADD. This means that any failures in the worker at this point will lose the jobs that have been popped from the queue. These jobs will not be retried by another worker since they were never pushed to the reserved queue.
The problematic code is in RedisQueue::blockingPop.
framework/src/Illuminate/Queue/RedisQueue.php
Lines 225 to 241 in 63b6108
The text was updated successfully, but these errors were encountered: