-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Laravel 5. PDOException. QUEUE_DRIVER=database 1213 Deadlock #7046
Comments
If you're running 10 workers, than the database driver isn't right for you. I'd suggesting using a driver actually designed for queuing like beanstalk. The database driver is there for people who want a simple quick queuing setup for a low number of jobs. |
@GrahamCampbell I've got deadlock with just 5 workers. That's too much for db-driver too? |
I get deadlocks even with 3 workers running for different queues each :-/ |
You can only use 1 worker if you want to use the database driver. If you need anything more, you should probably check out a real solution that's designed for queuing like beanstalkd, or something that supports first on, first off, like redis. Both are fine choices for queue drivers. |
@GrahamCampbell Maybe you just need to put a process ID to jobs table, for example, and read its value from artisan:worker to avoid dead locking? |
Hey @GrahamCampbell and @taylorotwell , I am getting the same error here. Up until recently I've been using the Beanstalkd driver. ( works like a charm really ) but I would love to take advantage of this new database driver... it's so much cleaner to manage jobs in my opinion, and easier to display what's 'in the queue' to the users if you wanted to do that. First off, I have extended the database driver to also put a 'key' on the queue jobs while pushing them in... and this has been super helpful for me. Does anyone have an idea on how we could use this in a production environment that absolutely needs multiple queue runners? Think of scaling out on this one, it has to support any number of runners that we need to keep our services running smoothly. |
Took a few days attempting debugging before coming across this post; may be worth mentioning in the documentation to only use one worker with the database queue? Would have definitely saved me a lot of time. |
Also getting deadlocks with multiple database workers. Seems strange that the documentation suggests creating 8 processes within the supervisor config - yet it's recommended to only use one (or am I missing the point here?). |
same here! Also CPU usages gets 100%. |
I ran into this as well (the documentation should be fixed). It seems that using |
You shouldn't be limited to 1 worker with database driver as @GrahamCampbell said. Don't know why he said that. Anyways, hopefully we can figure this out. I've seen it before myself when running quite a few database workers. |
The issue is in the query that migrates jobs that have been reserved too long to being available. We have a bunch of workers trying to execute that query simultaneously. We could simply ignore problems with that query and assume another worker is already doing it. I'm curious if even when you get deadlocks your queues keep processing as normal? |
@taylorotwell Databases aren't built for this sort of application, which is why we'll get these sort of issues at scale. I think you're totally correct that the queue will still work just fine, even with these errors though. They'll just get progressively worse the more workers you add, and eventually will just totally lock up. Say if you pushed a million jobs to the queue, and had 1000 db workers, |
@GrahamCampbell is correct in that DB's (specificially for me, MySQL) do not handle task queues well, due to the nature of row/table locks. When running Ignoring the deadlocks errors would help. Additionally, it would be useful to reduce the frequency at which expired reservations are checked, in a similar fashion to PHP & Laravel's session garbage collector. At the very least, the behavior should be documented. |
I agree that there might should be some sort of “lottery” or a different way to migrate the reserved jobs that need to be made available again.
|
Would definitely be interested in hacking on ideas with people if anyone is interested. A few thoughts come to mind. When using the DB driver definitely set the We need to rethink how jobs are migrated entirely most likely. I would be curious to just comment out that code entirely and see if people still have problems with the DB locking up so that we can isolate that as the actual problem. Can anyone try this? I have pushed code to 5.2 branch (dev stability) to just swallow those deadlock exceptions. If we get an exception on the job migration query we'll just assume another queue listener is handling it and move along. |
Here is a rought draft of what I'm thinking about changing the migration of old jobs code to:
Thoughts? I think this will avoid any deadlock problems with this query, and in addition I think we should have a small lottery so we don't need to run this each time the queue. Something like maybe a 10% chance of running it on any given iteration? |
So far I have processed 3,500 jobs with 8 workers running at the same time and have had no deadlock issues with this logic. |
I've pushed my logic to the 5.2 branch if anyone wants to use "dev" stability and try it out. Would really appreciate feedback! :) |
Currently running just short of 400k jobs with 11 daemon workers using the database driver. 40k complete so far and only 3 exceptions thrown with the message:
|
Those all completed and only ever received 4 of the exceptions that i mentioned above. |
Nice. Do you have the full stack trace on those exceptions? How many workers were you running? On Wed, May 18, 2016 at 4:37 AM -0700, "Marcus" notifications@github.com wrote: These all completed and only ever received 4 of the exceptions that i mentioned above. — |
|
Running 11 workers |
Interesting. So it was actually on deleting a failed job. Do you know why the jobs failed? Were you randomly causing jobs to fail just to test it? |
Duplicate primary key entry when inserting; genuine error in the job.
|
I'm having the same issue.
I do run on a cluster with 3 Mysql servers if that has anything to do with it. Dan |
@daniel-farina does the workaround I posted solve your problem? |
@micmania1 I tried your workaround and it seems to work, thanks! |
@micmania1 Do you did the changes in the vendor folder? |
@micmania1 if that works I would love a pull request to fix deadlock issues. I have to process 23K jobs. |
Hi guys, After running
So apparently the delete transaction got the lock on the I am wondering why the select itself is needed there. So I changed this: if ($this->database->table($this->table)->lockForUpdate()->find($id)) {
$this->database->table($this->table)->where('id', $id)->delete();
} To this: $this->database->table($this->table)->where('id', $id)->delete(); And testing it on the dev system. UPDATE: I am getting different kind of deadlock now:
So two concurrent job selections are locked. One tries to update the selected job while the another one blocks it.
Deadlock still happens obviously but workers are able to recover. |
For those following this issue; @ph4r05 is a fantastic provider of debugging information. This is the first time someone has checked where the deadlock actually occurs. In his case it's a gap lock on the jobs_queue_reserved_at_index index. This index was removed in the default 5.5 jobs table in acd933f So, if anyone still have deadlock issues, check your I should probably say something like "make sure you know your database before randomly removing indexes" or "take backups before drinking" or something. This is an internet forum, I'm talking about changes in important system tables. You know the deal. ;) |
Thanks to the @sisve for discussion. I didn't notice the index was removed as I started the project on older Laravel and database migrations did not remove the index during Laravel upgrade (as far as I am aware) - may be the same case for others in this discussion. After removing the index there are no more deadlocks. @taylorotwell @20TRIES on a discussion about lock contention in DB workers - for the research and testing purposes I reimplemented pessimistic locking of the database queue to optimistic locking. Take a look at my commit if interested: ph4r05@60fcddd There are no explicit transactions involved and no DB level locking. DB workers compete for jobs on their I don't think this is a production ready code (don't use without proper testing) and some benchmarking for pessimistic vs optimistic locking would be nice (I can get to that in a few days). I can submit PR later (once we have benchmarks) if you deem it interesting. |
This is only a theory, but this is what I think is causing the deadlock on delete.
This would mean that a job could get processed twice. I am running this with the assumption of a transaction retry (ie. the workaround I provided). Whilst the delete retry solved my problem locally, when we began testing in what will be our prod environment we ran into further issues. We're using a galera cluster with master-master db setup. Locking doesn't work the same way and the default queue simply can't be ran concurrently with that setup. I've solved this in our project by separating the queue into multiple streams where each worker is responsible for a single stream. When a job is saved it will pick one of the available streams at random. You could potentially do this by using different queues too (one per environment) but I never tried that. This introduces other potential issues (mostly operational), but essentially avoids locking/conflicts completely which is our biggest concern and provided good speed and some scalability. If you do have the luxury of moving to a different queue system, you should do that. |
There was a bug in the database queue transaction handling (no rollback) which caused the freezing of the workers. Take a look at the pull request #22394 (maybe check if you have the The second worker should not pick the job as it is being reserved. If job is being processed twice reconsider tuning retryAfter parameter. I am currently using the queue with PR merged and it works perfectly for me. |
@micmania1 Use |
@sisve @micmania1 I have a new example of deadlocks on the With #22394 and #22433 merged the DatabaseQueue workers will recover from the deadlock (no freeze) but some jobs will get processed more than once. Here is why and how: I created a benchmark project for Laravel queueing for testing of various job processing modifications. The project is fresh Laravel 5.5 (created 1 day ago with composer). I generate a synthetic load and measure correctness and performance. The scenario:
CREATE TABLE `jobs` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`queue` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`payload` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
`attempts` tinyint(3) unsigned NOT NULL,
`reserved_at` int(10) unsigned DEFAULT NULL,
`available_at` int(10) unsigned NOT NULL,
`created_at` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `jobs_queue_index` (`queue`)
) ENGINE=InnoDB AUTO_INCREMENT=353855 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci Deadlock:
Both transactions are locking the same record. The problem is The job was already processed but not removed from the database so it will be picked up after it expires. This can happen multiple times. It would be preferable the transaction 1 is rolled back (select) instead of the delete but I am not aware this is possible to set. The funny thing is the following hack solves the problem: public function deleteReserved($queue, $id)
{
$this->database->transaction(function () use ($queue, $id) {
$this->database->table($this->table)->where('id', $id)->delete();
}, 5);
} As the worker really insists to delete that job. With magic constant 5 I had no deadlock anymore. Another way out is adding a new flag field public function deleteReserved($queue, $id)
{
$this->database->transaction(function () use ($queue, $id) {
$this->database->table($this->table)->where('id', $id)->update(['delete_mark' => 1]);
});
$this->database->table($this->table)->where('id', $id)->delete();
} If worker picks a job with Another solution is optimistic locking which I am testing right now - no deadlocks. I will publish a blog with benchmarks soon but the performance is not very shiny with the first implementation (slower than pessimistic, 100 jobs per second). Another solution is removing the index altogether (slow, 90 jobs per second). If Just for comparison, |
I'm looking at the innodb status you've provided, but I do not see the actual culprit.
Your status does not tell us who has already acquired a conflicting lock on jobs_queue_index. These locks are not gap locks, so it's not the same problem as with the jobs_queue_reserved_at_index. Do you perhaps have a open connection from your desktop where you've accidentally started a long-running transaction and caused some locks? |
@sisve thanks for your time! If you want to take a look at this complete report: https://github.com/ph4r05/laravel-queueing-benchmark/blob/master/.github/deadlock07.txt I think the index is held by select queries. set global innodb_deadlock_detect=0;
set global innodb_lock_wait_timeout=1000;
set innodb_lock_wait_timeout=1000; |
We have two different transactions going on.
I'll be guessing more than usual from here on. It seems like the select that fetches the jobs issues locks on both the jobs_queue_index and the primary key (guessing that it does it in that order), while the select that exists in deleteReserved only locks the primary key. This allows the fetch-select to grab half (the index) of the locks they require and block waiting for the primary key lock, while the deletion transaction has grabbed the primary key lock and is now blocking on the index lock. So, while I am still guessing wildly, can we trick the database to issue index locks? Can we modify the locking in deleteReserved so it's based on All this is extremely dependent on the table structure. Any other indexes would probably also add the the chaos and madness. Mmm madness... |
I made some benchmarks on Amazon c4.large of the Laravel queueing methods (redis, beanstalkd, database): I also implemented optimistic locking strategy for database driver (no deadlocks), blogpost + repo: |
Hey guys, I've managed to test running 500,000 jobs by 15 workers without any deadlocks with the PR (#31287). thoughts please :) |
I'm experiencing similar deadlock behavior with Laravel 8 + Mysql 8 with job batching. From what I can tell, it seems the DatabaseBatchRepository could benefit from the same fix applied in @themsaid's PR(#31287) due to |
Laravel 8 with MySQL 5.7. 100 jobs with 8 workers and still run into same error. However, i noticed that job/worker ratio is crucial here. When i had 10 jobs on average with 8 workers, i had more deadlock errors. Thought it might help |
Hi everyone, in my project I have tow servers with Laravel 10.48.17, MariaDB, supervisor: and there is a problem, deadlocks on delete job happening only on second server, The only difference between their queue configurations is the number of workers I already expirement with workers size, but that didnt work, maybe I should test only 1 worker, or upgrade database cpu and memory) |
I use QUEUE_DRIVER=database
When I run more than 10 workers, I get the following error:
The text was updated successfully, but these errors were encountered: