-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[5.6] blocking pop from redis queues #22284
[5.6] blocking pop from redis queues #22284
Conversation
Is there no worry about race conditions here? I noticed you aren't using Lua scripts. |
No race condition. If I recall, before using lua scripts, we had race conditions due to the According to blpop docs there is an interesting behavior when having multiple workers:
Therefore, the wait duration on the blocking pop is inversely proportional to the traffic of the queue. This can have interesting applications in the future. |
See laravel/framework#22284 I actually find Horizon without `blpop` to have really bad latency in processing new jobs if idle. Using blocking pop reduces latency significantly. Personally, I think it should default to using `blpop`. Without it, you need to wait until the next worker ends its cooldown cycle for the job to get picked up. If it's doing a blocking pop, redis will instead serve a job to a worker immediately as it enters the queue if there's an idle worker waiting for them. This _really_ speeds things up. This option is in the 5.6 branch, so I guess this'll need to wait until 5.6 is released, but I'd love it if this was also backported to 5.5 (I don't think it should break BC?)
I'm looking at the code for this and wondering; what happens if something crashes between the call to BLPOP and the following ZADD? Will jobs be lost if that happens? |
@sisve Yes the job will get lost. That was a reason for me not to entirely remove the current non-blocking algorithm from the code.
I assume the implementation of Laravel redis queue based on |
The problem is now that this new feature of dropping jobs isn't documented anywhere. And that your personal use-case is now a new feature in a widely used framework. And this new feature wasn't clear when merged.
The append-only-file feature has a |
For sure the feature can be documented. Please also note that the default configuration does not uses this feature at all. So now we can do either of the following:
I go with 1. |
I'm for 2) => revert this before a release happens Then: provide a new PR which doens't fall short on this problem which is IMHO an absolute deal breaker. |
Your 'personal use-cases' shouldn't be forced on others. I also vote for option 2, revert this change and create a new, non-breaking PR. |
I don't feel quoting 'personal use-cases' and using 'be forced' in you sentence to bring a point is fair without considering the followings: First, I don't know why someone should sends PR to any package unless they personally have a use-case for it. Second, caring about a personal use-case does not mean the lack of generality. A personal use-case can be a use-case for others as well. Decision about the generality of the use-case is not mine. I just send PRs and the core developers accept it if they fill it adds value to their software. In this specific case, dealing with real-time requests that should be handled as soon as possible and dropped if delayed or failed with no retry is general enough. Third, the current 5.5 redis driver is written by me fixing issues existed in previous versions including the one that is your concern now - jobs may get lost using the previous drivers. So it is not the first time I send PR for my personal use-case. Lastly, please note that I have already mentioned the drawback of the proposed optional algorithm when I first send the PR #22284 before it gets accepted. I also made sure that the change will not be mandatory and by default you are not using this new algorithm. |
The problem arises not because we do weird things, but these things aren't documented anywhere. They are not even part of of the PR that introduced the feature, but in another PR that was closed due to inactivity. We should at least document the possibility of dropping jobs somewhere. My all-seeing (and always bitter) eye forecasts huge backlash when people have followed twitter advice and Laravel documentation to setup their Redis configuration, and then loses jobs. It would be even worse if the developer didn't immediately detect this, but if it takes a few weeks for someone to notice that jobs just "disappear". That can become a huge hassle to debug, and produce lots of bad-will. At least a warning box in the documentation would give us some way of saying "hey, we told you it wasn't ready for production use". I've opened a new issue (#22939) to bring the problem into light and hopefully discuss a solution. It could either be a rewrite of the logic, or just documenting the behavior. The important part is that we do not silently accept this functionality. We have many parts of the framework that has weird issues. That's often the case in software development. Like migrations that has kinks with enum columns. Or migrations and sqlite. These things could probably be solved with code, but they are at least documented in red warning boxes in the documentation. I believe that a minimum required change is a documentation change. I believe the discussion in this PR has shown that there's at least a need to look into the problem. I am hoping that we can discuss the solution in the new issue at #22939. |
So, what is the solution? |
I get what you’re saying but just because you don’t care about lost jobs doesn’t mean others don’t. That’s a bad assumption to make. I’m with sisve that the bare minimum should be to document the behaviour so there are no surprises. That would be a headache to track down for the majority of users. If the changes can be implemented without dropping jobs, isn’t that ideal? |
I think this might be causing "lost" jobs for my application (which uses Horizon) in that the jobs are taken from the queue but never reserved so they're never executed. I didn't see this issue until after writing this comment in laravel/horizon but given the issue only impacts 5.6 and not 5.5, and there were changes made to the I've read through this issue but I'm not very familiar with redis: what's the fix for this? I see in another issue about this there's mention of public function connect(array $config)
{
return new RedisQueue(
$this->redis, $config['queue'],
Arr::get($config, 'connection', $this->connection),
Arr::get($config, 'retry_after', 60),
Arr::get($config, 'block_for', 0)
);
} I deployed 5.6 to production and I'm losing quite a lot of jobs, fortunately the jobs are non-critical and can be ran multiple times without side effects so it's no big deal that I have to keep adding them to the queue until they eventually run, but if anyone can help me with the steps to resolve this that would be great :) |
If you are loosing "quite a lot of jobs" I think that has nothing to do with blocking pop feature being unreliable in maybe 0.00001% of the time. But having said that, I think there is a problem in horizon configuration as you mentioned. I also think commit dbad055 that allows for blocking for ever is not a good idea. It is in contradiction with job timeouts, and probably Redis connection timeout. |
@halaei You're making up numbers to prove your point, and I am getting tired of your arguments in this matter. Yes, the commit that allows blocking_for=0 may be odd. This becomes a problem where Horizon defaults to 0 for all upgraded Horizon installations that has the old config file where blocking_for is missing. That is (probably) a lot of installations. You've argued that you do not care about lost jobs for your use-cases. You've shown that you do not know how to configure redis for reliability. We get that. But don't tell us that our problems are improbable 0.00001% cases. |
@sisve |
Reopen #22040