Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.6] blocking pop from redis queues #22284

Merged
merged 1 commit into from
Dec 4, 2017

Conversation

halaei
Copy link
Contributor

@halaei halaei commented Dec 2, 2017

Reopen #22040

@taylorotwell
Copy link
Member

Is there no worry about race conditions here? I noticed you aren't using Lua scripts.

@halaei
Copy link
Contributor Author

halaei commented Dec 4, 2017

No race condition. If I recall, before using lua scripts, we had race conditions due to the multi-watch-exec transactions for migrating jobs. Then we replaced it with lua script. That part of lua replacement is still being used in this PR.

According to blpop docs there is an interesting behavior when having multiple workers:

If multiple clients [workers] are blocked for the same key [queue], the first client to be served is the one that was waiting for more time (the first that blocked for the key).

Therefore, the wait duration on the blocking pop is inversely proportional to the traffic of the queue. This can have interesting applications in the future.
I have used blpop in fairly high traffic queues (like 100 jobs a second on peak). Using this you may even write a console based web server, with the benefit of removing framework bootstrap time.

@taylorotwell taylorotwell merged commit 63b6108 into laravel:master Dec 4, 2017
francislavoie added a commit to francislavoie/horizon that referenced this pull request Jan 17, 2018
See laravel/framework#22284

I actually find Horizon without `blpop` to have really bad latency in processing new jobs if idle. Using blocking pop reduces latency significantly. Personally, I think it should default to using `blpop`. Without it, you need to wait until the next worker ends its cooldown cycle for the job to get picked up. If it's doing a blocking pop, redis will instead serve a job to a worker immediately as it enters the queue if there's an idle worker waiting for them. This _really_ speeds things up.

This option is in the 5.6 branch, so I guess this'll need to wait until 5.6 is released, but I'd love it if this was also backported to 5.5 (I don't think it should break BC?)
@sisve
Copy link
Contributor

sisve commented Jan 26, 2018

I'm looking at the code for this and wondering; what happens if something crashes between the call to BLPOP and the following ZADD? Will jobs be lost if that happens?

@halaei
Copy link
Contributor Author

halaei commented Jan 26, 2018

@sisve Yes the job will get lost. That was a reason for me not to entirely remove the current non-blocking algorithm from the code.
For my personal use-cases, I don't care a lot about lost jobs in Redis. After all, the redis itself never guarantee 100% durability. Even with enabling append only logs, there is chance of jobs getting lost if the server crashes before saving into disk.
That said, it is still possible to improve the reliability of our code by doing the following:

  1. Reverse the direction of redis lists - Push to left and pop from right. Currently, we are pushing to the right and popping from the left.
  2. Use the brpoplpush command to have a more reliable queue as described here.

I assume the implementation of Laravel redis queue based on brpoplpush will be a bit more complicated.

@sisve
Copy link
Contributor

sisve commented Jan 27, 2018

For my personal use-cases, I don't care a lot about lost jobs in Redis.

The problem is now that this new feature of dropping jobs isn't documented anywhere. And that your personal use-case is now a new feature in a widely used framework. And this new feature wasn't clear when merged.

After all, the redis itself never guarantee 100% durability. Even with enabling append only logs, there is chance of jobs getting lost if the server crashes before saving into disk.

The append-only-file feature has a appendfsync=always setting that will fsync all the writes, basically countering your example of losing data. That single setting protects us from both process failure (redis crashes) and system failure (machine crashes) since every write is written to disk. So redis can be configured to be durable, it's just that you're (probably) using a default config that only calls fsync every second, not after every write.

@halaei
Copy link
Contributor Author

halaei commented Jan 27, 2018

For sure the feature can be documented. Please also note that the default configuration does not uses this feature at all.
About appendfsync=always, I was not sure if it means that Redis blocks and return the result/acknowledgement of operation after safely saving the log. It seems that it actually does so, which means you are right.

So now we can do either of the following:

  1. Ignore the fact that the new algorithm does not guarantee durability and use it only when data loss is not critical, or when you have recovery mechanisms implemented.
  2. Send a PR to remove the new feature.
  3. Send a PR to fix this via brpoplpush.

I go with 1.

@mfn
Copy link
Contributor

mfn commented Jan 28, 2018

I'm for 2) => revert this before a release happens

Then: provide a new PR which doens't fall short on this problem which is IMHO an absolute deal breaker.

@bkilshaw
Copy link

bkilshaw commented Jan 29, 2018

For my personal use-cases, I don't care a lot about lost jobs in Redis.

Your 'personal use-cases' shouldn't be forced on others.

I also vote for option 2, revert this change and create a new, non-breaking PR.

@halaei
Copy link
Contributor Author

halaei commented Jan 29, 2018

I don't feel quoting 'personal use-cases' and using 'be forced' in you sentence to bring a point is fair without considering the followings:

First, I don't know why someone should sends PR to any package unless they personally have a use-case for it.

Second, caring about a personal use-case does not mean the lack of generality. A personal use-case can be a use-case for others as well. Decision about the generality of the use-case is not mine. I just send PRs and the core developers accept it if they fill it adds value to their software. In this specific case, dealing with real-time requests that should be handled as soon as possible and dropped if delayed or failed with no retry is general enough.

Third, the current 5.5 redis driver is written by me fixing issues existed in previous versions including the one that is your concern now - jobs may get lost using the previous drivers. So it is not the first time I send PR for my personal use-case.

Lastly, please note that I have already mentioned the drawback of the proposed optional algorithm when I first send the PR #22284 before it gets accepted. I also made sure that the change will not be mandatory and by default you are not using this new algorithm.

@sisve
Copy link
Contributor

sisve commented Jan 29, 2018

The problem arises not because we do weird things, but these things aren't documented anywhere. They are not even part of of the PR that introduced the feature, but in another PR that was closed due to inactivity. We should at least document the possibility of dropping jobs somewhere.

My all-seeing (and always bitter) eye forecasts huge backlash when people have followed twitter advice and Laravel documentation to setup their Redis configuration, and then loses jobs. It would be even worse if the developer didn't immediately detect this, but if it takes a few weeks for someone to notice that jobs just "disappear". That can become a huge hassle to debug, and produce lots of bad-will. At least a warning box in the documentation would give us some way of saying "hey, we told you it wasn't ready for production use".

I've opened a new issue (#22939) to bring the problem into light and hopefully discuss a solution. It could either be a rewrite of the logic, or just documenting the behavior. The important part is that we do not silently accept this functionality.

We have many parts of the framework that has weird issues. That's often the case in software development. Like migrations that has kinks with enum columns. Or migrations and sqlite. These things could probably be solved with code, but they are at least documented in red warning boxes in the documentation.

I believe that a minimum required change is a documentation change. I believe the discussion in this PR has shown that there's at least a need to look into the problem. I am hoping that we can discuss the solution in the new issue at #22939.

@taylorotwell
Copy link
Member

So, what is the solution?

@bkilshaw
Copy link

I get what you’re saying but just because you don’t care about lost jobs doesn’t mean others don’t. That’s a bad assumption to make. I’m with sisve that the bare minimum should be to document the behaviour so there are no surprises. That would be a headache to track down for the majority of users.

If the changes can be implemented without dropping jobs, isn’t that ideal?

@shrink
Copy link
Contributor

shrink commented Feb 10, 2018

I think this might be causing "lost" jobs for my application (which uses Horizon) in that the jobs are taken from the queue but never reserved so they're never executed. I didn't see this issue until after writing this comment in laravel/horizon but given the issue only impacts 5.6 and not 5.5, and there were changes made to the retrieveNextJob method which is used in pop which horizon makes use of... could this be the cause?

I've read through this issue but I'm not very familiar with redis: what's the fix for this? I see in another issue about this there's mention of block_for being problematic, and it looks like Horizon uses that, albeit with a value of 0:

    public function connect(array $config)
    {
        return new RedisQueue(
            $this->redis, $config['queue'],
            Arr::get($config, 'connection', $this->connection),
            Arr::get($config, 'retry_after', 60),
            Arr::get($config, 'block_for', 0)
        );
    }

I deployed 5.6 to production and I'm losing quite a lot of jobs, fortunately the jobs are non-critical and can be ran multiple times without side effects so it's no big deal that I have to keep adding them to the queue until they eventually run, but if anyone can help me with the steps to resolve this that would be great :)

@halaei
Copy link
Contributor Author

halaei commented Feb 10, 2018

If you are loosing "quite a lot of jobs" I think that has nothing to do with blocking pop feature being unreliable in maybe 0.00001% of the time. But having said that, I think there is a problem in horizon configuration as you mentioned. I also think commit dbad055 that allows for blocking for ever is not a good idea. It is in contradiction with job timeouts, and probably Redis connection timeout.
@taylorotwell

@sisve
Copy link
Contributor

sisve commented Feb 10, 2018

@halaei You're making up numbers to prove your point, and I am getting tired of your arguments in this matter.

Yes, the commit that allows blocking_for=0 may be odd. This becomes a problem where Horizon defaults to 0 for all upgraded Horizon installations that has the old config file where blocking_for is missing. That is (probably) a lot of installations.

You've argued that you do not care about lost jobs for your use-cases. You've shown that you do not know how to configure redis for reliability. We get that. But don't tell us that our problems are improbable 0.00001% cases.

@halaei
Copy link
Contributor Author

halaei commented Feb 10, 2018

@sisve
Read https://github.com/laravel/framework/blob/5.6/CODE_OF_CONDUCT.md
You just return my commend to myself without knowing the context. @citricsquid has a issue that are not at all related to the reliability of blpop. I don't want to prove any fucking point and I am not following this anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants