Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Laravel 5. PDOException. QUEUE_DRIVER=database 1213 Deadlock #7046

Closed
easmith opened this issue Jan 19, 2015 · 72 comments
Closed

Laravel 5. PDOException. QUEUE_DRIVER=database 1213 Deadlock #7046

easmith opened this issue Jan 19, 2015 · 72 comments

Comments

@easmith
Copy link

easmith commented Jan 19, 2015

I use QUEUE_DRIVER=database
When I run more than 10 workers, I get the following error:

'PDOException' with message 'SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction
[2015-01-19 16:45:00] production.ERROR: exception 'PDOException' with message 'SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction' in /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php:380
Stack trace:
#0 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(380): PDOStatement->execute(Array)
#1 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(606): Illuminate\Database\Connection->Illuminate\Database\{closure}(Object(Illuminate\Database\MySqlConnection), 'update `jobs` s...', Array)
#2 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(570): Illuminate\Database\Connection->runQueryCallback('update `jobs` s...', Array, Object(Closure))
#3 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(383): Illuminate\Database\Connection->run('update `jobs` s...', Array, Object(Closure))
#4 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(328): Illuminate\Database\Connection->affectingStatement('update `jobs` s...', Array)
#5 /home/www/vendor/laravel/framework/src/Illuminate/Database/Query/Builder.php(1747): Illuminate\Database\Connection->update('update `jobs` s...', Array)
#6 /home/www/vendor/laravel/framework/src/Illuminate/Queue/DatabaseQueue.php(181): Illuminate\Database\Query\Builder->update(Array)
#7 /home/www/vendor/laravel/framework/src/Illuminate/Queue/DatabaseQueue.php(146): Illuminate\Queue\DatabaseQueue->releaseJobsThatHaveBeenReservedTooLong('queuename1')
#8 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(180): Illuminate\Queue\DatabaseQueue->pop('queuename1')
#9 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(150): Illuminate\Queue\Worker->getNextJob(Object(Illuminate\Queue\DatabaseQueue), 'queuename1')
#10 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(113): Illuminate\Queue\Worker->pop(NULL, 'queuename1', '0', '3', '3')
#11 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(85): Illuminate\Queue\Worker->runNextJobForDaemon(NULL, 'queuename1', '0', '3', '3')
#12 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(101): Illuminate\Queue\Worker->daemon(NULL, 'queuename1', '0', '256', '3', '3')
#13 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(67): Illuminate\Queue\Console\WorkCommand->runWorker(NULL, 'queuename1', '0', '256', true)
#14 [internal function]: Illuminate\Queue\Console\WorkCommand->fire()
#15 /home/www/vendor/laravel/framework/src/Illuminate/Container/Container.php(523): call_user_func_array(Array, Array)
#16 /home/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(114): Illuminate\Container\Container->call(Array)
#17 /home/www/vendor/symfony/console/Symfony/Component/Console/Command/Command.php(253): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#18 /home/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(100): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#19 /home/www/vendor/symfony/console/Symfony/Component/Console/Application.php(874): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#20 /home/www/vendor/symfony/console/Symfony/Component/Console/Application.php(195): Symfony\Component\Console\Application->doRunCommand(Object(Illuminate\Queue\Console\WorkCommand), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#21 /home/www/vendor/symfony/console/Symfony/Component/Console/Application.php(126): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#22 /home/www/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(91): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#23 /home/www/artisan(34): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#24 {main}

Next exception 'Illuminate\Database\QueryException' with message 'SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction (SQL: update `jobs` set `reserved` = 0, `reserved_at` = , `attempts` = attempts + 1 where `queue` = queuename1 and `reserved` = 1 and `reserved_at` <= 1421664300)' in /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php:614
Stack trace:
#0 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(570): Illuminate\Database\Connection->runQueryCallback('update `jobs` s...', Array, Object(Closure))
#1 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(383): Illuminate\Database\Connection->run('update `jobs` s...', Array, Object(Closure))
#2 /home/www/vendor/laravel/framework/src/Illuminate/Database/Connection.php(328): Illuminate\Database\Connection->affectingStatement('update `jobs` s...', Array)
#3 /home/www/vendor/laravel/framework/src/Illuminate/Database/Query/Builder.php(1747): Illuminate\Database\Connection->update('update `jobs` s...', Array)
#4 /home/www/vendor/laravel/framework/src/Illuminate/Queue/DatabaseQueue.php(181): Illuminate\Database\Query\Builder->update(Array)
#5 /home/www/vendor/laravel/framework/src/Illuminate/Queue/DatabaseQueue.php(146): Illuminate\Queue\DatabaseQueue->releaseJobsThatHaveBeenReservedTooLong('queuename1')
#6 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(180): Illuminate\Queue\DatabaseQueue->pop('queuename1')
#7 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(150): Illuminate\Queue\Worker->getNextJob(Object(Illuminate\Queue\DatabaseQueue), 'queuename1')
#8 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(113): Illuminate\Queue\Worker->pop(NULL, 'queuename1', '0', '3', '3')
#9 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(85): Illuminate\Queue\Worker->runNextJobForDaemon(NULL, 'queuename1', '0', '3', '3')
#10 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(101): Illuminate\Queue\Worker->daemon(NULL, 'queuename1', '0', '256', '3', '3')
#11 /home/www/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(67): Illuminate\Queue\Console\WorkCommand->runWorker(NULL, 'queuename1', '0', '256', true)
#12 [internal function]: Illuminate\Queue\Console\WorkCommand->fire()
#13 /home/www/vendor/laravel/framework/src/Illuminate/Container/Container.php(523): call_user_func_array(Array, Array)
#14 /home/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(114): Illuminate\Container\Container->call(Array)
#15 /home/www/vendor/symfony/console/Symfony/Component/Console/Command/Command.php(253): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#16 /home/www/vendor/laravel/framework/src/Illuminate/Console/Command.php(100): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#17 /home/www/vendor/symfony/console/Symfony/Component/Console/Application.php(874): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#18 /home/www/vendor/symfony/console/Symfony/Component/Console/Application.php(195): Symfony\Component\Console\Application->doRunCommand(Object(Illuminate\Queue\Console\WorkCommand), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#19 /home/www/vendor/symfony/console/Symfony/Component/Console/Application.php(126): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#20 /home/www/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(91): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#21 /home/www/artisan(34): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#22 {main}  
@easmith easmith changed the title Laravel 5. PDOException. QUEUE_DRIVER=database PDO::MYSQL_ATTR_USE_BUFFERED_QUERY Laravel 5. PDOException. QUEUE_DRIVER=database Jan 19, 2015
@easmith easmith changed the title Laravel 5. PDOException. QUEUE_DRIVER=database Laravel 5. PDOException. QUEUE_DRIVER=database 1213 Deadlock Jan 19, 2015
@GrahamCampbell
Copy link
Member

If you're running 10 workers, than the database driver isn't right for you. I'd suggesting using a driver actually designed for queuing like beanstalk. The database driver is there for people who want a simple quick queuing setup for a low number of jobs.

@alexglue
Copy link

@GrahamCampbell I've got deadlock with just 5 workers. That's too much for db-driver too?

@bequadrat
Copy link

I get deadlocks even with 3 workers running for different queues each :-/

@GrahamCampbell
Copy link
Member

I get deadlocks even with 3 workers running for different queues each :-/

You can only use 1 worker if you want to use the database driver. If you need anything more, you should probably check out a real solution that's designed for queuing like beanstalkd, or something that supports first on, first off, like redis. Both are fine choices for queue drivers.

@alexglue
Copy link

@GrahamCampbell Maybe you just need to put a process ID to jobs table, for example, and read its value from artisan:worker to avoid dead locking?

@ryanhungate
Copy link

Hey @GrahamCampbell and @taylorotwell , I am getting the same error here. Up until recently I've been using the Beanstalkd driver. ( works like a charm really ) but I would love to take advantage of this new database driver... it's so much cleaner to manage jobs in my opinion, and easier to display what's 'in the queue' to the users if you wanted to do that.

First off, I have extended the database driver to also put a 'key' on the queue jobs while pushing them in... and this has been super helpful for me. Does anyone have an idea on how we could use this in a production environment that absolutely needs multiple queue runners? Think of scaling out on this one, it has to support any number of runners that we need to keep our services running smoothly.

@20TRIES
Copy link
Contributor

20TRIES commented Apr 7, 2016

Took a few days attempting debugging before coming across this post; may be worth mentioning in the documentation to only use one worker with the database queue? Would have definitely saved me a lot of time.

@bbthorntz
Copy link

Also getting deadlocks with multiple database workers. Seems strange that the documentation suggests creating 8 processes within the supervisor config - yet it's recommended to only use one (or am I missing the point here?).

@hayesrobin
Copy link

same here! Also CPU usages gets 100%.

@rabbitfang
Copy link

I ran into this as well (the documentation should be fixed). It seems that using queue:listen instead of queue:work --daemon removes the issues caused by the deadlocks. They still occur, but don't cause any problems for me as the listener automatically causes a retry. I am running at 8 queue listeners.

@taylorotwell taylorotwell reopened this May 17, 2016
@taylorotwell
Copy link
Member

You shouldn't be limited to 1 worker with database driver as @GrahamCampbell said. Don't know why he said that.

Anyways, hopefully we can figure this out. I've seen it before myself when running quite a few database workers.

@taylorotwell
Copy link
Member

The issue is in the query that migrates jobs that have been reserved too long to being available. We have a bunch of workers trying to execute that query simultaneously. We could simply ignore problems with that query and assume another worker is already doing it. I'm curious if even when you get deadlocks your queues keep processing as normal?

@GrahamCampbell
Copy link
Member

@taylorotwell Databases aren't built for this sort of application, which is why we'll get these sort of issues at scale. I think you're totally correct that the queue will still work just fine, even with these errors though. They'll just get progressively worse the more workers you add, and eventually will just totally lock up. Say if you pushed a million jobs to the queue, and had 1000 db workers,

@rabbitfang
Copy link

@GrahamCampbell is correct in that DB's (specificially for me, MySQL) do not handle task queues well, due to the nature of row/table locks. When running queue:listen (backed by supervisord), Laravel was able to continue processing the queue, even with the deadlocks. When using queue:work --daemon, the entire DB locked up; stopping the workers was not enough to clear the deadlocks in the DB, a full MySQL restart was needed. I'm not sure why queue:work --daemon locked up the DB, but it might have had to do with the speed at which it was able to process tasks.

Ignoring the deadlocks errors would help. Additionally, it would be useful to reduce the frequency at which expired reservations are checked, in a similar fashion to PHP & Laravel's session garbage collector. At the very least, the behavior should be documented.

@taylorotwell
Copy link
Member

I agree that there might should be some sort of “lottery” or a different way to migrate the reserved jobs that need to be made available again.

On May 17, 2016, at 1:59 PM, rabbitfang notifications@github.com wrote:

@GrahamCampbell https://github.com/GrahamCampbell is correct in that DB's (specificially for me, MySQL) do not handle task queues well, due to the nature of row/table locks. When running queue:listen (backed by supervisord), Laravel was able to continue processing the queue, even with the deadlocks. When using queue:work --daemon, the entire DB locked up; stopping the workers was not enough to clear the deadlocks in the DB, a full MySQL restart was needed. I'm not sure why queue:work --daemon locked up the DB, but it might have had to do with the speed at which it was able to process tasks.

Ignoring the deadlocks errors would help. Additionally, it would be useful to reduce the frequency at which expired reservations are checked, in a similar fashion to PHP & Laravel's session garbage collector. At the very least, the behavior should be documented.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #7046 (comment)

@taylorotwell
Copy link
Member

Would definitely be interested in hacking on ideas with people if anyone is interested. A few thoughts come to mind. When using the DB driver definitely set the --sleep option to something reasonable so it's not just pinging your database constantly especially if you don't have a ton of jobs.

We need to rethink how jobs are migrated entirely most likely. I would be curious to just comment out that code entirely and see if people still have problems with the DB locking up so that we can isolate that as the actual problem. Can anyone try this?

I have pushed code to 5.2 branch (dev stability) to just swallow those deadlock exceptions. If we get an exception on the job migration query we'll just assume another queue listener is handling it and move along.

@taylorotwell
Copy link
Member

Here is a rought draft of what I'm thinking about changing the migration of old jobs code to:

            $this->database->beginTransaction();

            $ready = $this->database->table($this->table)
                        ->lockForUpdate()
                        ->where('queue', $this->getQueue($queue))
                        ->where('reserved', 1)
                        ->where('reserved_at', '<=', $expired)
                        ->get();

            $ids = \Illuminate\Support\Collection::make($ready)->map(function ($row) {
                return $row->id;
            })->all();

            $this->database->table($this->table)
                ->whereIn('id', $ids)
                ->update([
                    'reserved' => 0,
                    'reserved_at' => null,
                    'attempts' => new Expression('attempts + 1'),
                ]);

            $this->database->commit();

Thoughts? I think this will avoid any deadlock problems with this query, and in addition I think we should have a small lottery so we don't need to run this each time the queue. Something like maybe a 10% chance of running it on any given iteration?

@taylorotwell
Copy link
Member

So far I have processed 3,500 jobs with 8 workers running at the same time and have had no deadlock issues with this logic.

@taylorotwell
Copy link
Member

I've pushed my logic to the 5.2 branch if anyone wants to use "dev" stability and try it out. Would really appreciate feedback! :)

@20TRIES
Copy link
Contributor

20TRIES commented May 18, 2016

Currently running just short of 400k jobs with 11 daemon workers using the database driver. 40k complete so far and only 3 exceptions thrown with the message:

SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction (SQL: delete from jobs where id = 30990)

@20TRIES
Copy link
Contributor

20TRIES commented May 18, 2016

Those all completed and only ever received 4 of the exceptions that i mentioned above.

@taylorotwell
Copy link
Member

Nice. Do you have the full stack trace on those exceptions? How many workers were you running?

On Wed, May 18, 2016 at 4:37 AM -0700, "Marcus" notifications@github.com wrote:

These all completed and only ever received 4 of the exceptions that i mentioned above.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@20TRIES
Copy link
Contributor

20TRIES commented May 18, 2016

Failed Queue Job: App\Jobs\CreateOrder SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction (SQL: delete from jobs where id = 30990)
vendor/laravel/framework/src/Illuminate/Support/Facades/Facade.php:224 Illuminate\Support\Facades\Facade::__callStatic
app/Listeners/ReportFailedQueueJob.php:61 App\Listeners\ReportFailedQueueJob::handle
[internal] call_user_func_array
vendor/laravel/framework/src/Illuminate/Events/Dispatcher.php:347 Illuminate\Events\Dispatcher::Illuminate\Events{closure}
[internal] call_user_func_array
vendor/laravel/framework/src/Illuminate/Events/Dispatcher.php:221 Illuminate\Events\Dispatcher::fire
app/Libraries/CustomLaravelComponents/Queue/Worker.php:100 App\Libraries\CustomLaravelComponents\Queue\Worker::raiseFailedJobEvent
app/Libraries/CustomLaravelComponents/Queue/Worker.php:82 App\Libraries\CustomLaravelComponents\Queue\Worker::logFailedJob
app/Libraries/CustomLaravelComponents/Queue/Worker.php:36 App\Libraries\CustomLaravelComponents\Queue\Worker::process
vendor/laravel/framework/src/Illuminate/Queue/Worker.php:155 Illuminate\Queue\Worker::pop
vendor/laravel/framework/src/Illuminate/Queue/Worker.php:111 Illuminate\Queue\Worker::runNextJobForDaemon
vendor/laravel/framework/src/Illuminate/Queue/Worker.php:85 Illuminate\Queue\Worker::daemon
vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php:103 Illuminate\Queue\Console\WorkCommand::runWorker
vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php:71 Illuminate\Queue\Console\WorkCommand::fire
[internal] call_user_func_array
vendor/laravel/framework/src/Illuminate/Container/Container.php:507 Illuminate\Container\Container::call
vendor/laravel/framework/src/Illuminate/Console/Command.php:169 Illuminate\Console\Command::execute
vendor/symfony/console/Command/Command.php:256 Symfony\Component\Console\Command\Command::run
vendor/laravel/framework/src/Illuminate/Console/Command.php:155 Illuminate\Console\Command::run
vendor/symfony/console/Application.php:791 Symfony\Component\Console\Application::doRunCommand
vendor/symfony/console/Application.php:186 Symfony\Component\Console\Application::doRun
vendor/symfony/console/Application.php:117 Symfony\Component\Console\Application::run
vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php:107 Illuminate\Foundation\Console\Kernel::handle
artisan:35 [main]

@20TRIES
Copy link
Contributor

20TRIES commented May 18, 2016

Running 11 workers

@taylorotwell
Copy link
Member

Interesting. So it was actually on deleting a failed job. Do you know why the jobs failed? Were you randomly causing jobs to fail just to test it?

@20TRIES
Copy link
Contributor

20TRIES commented May 18, 2016

Duplicate primary key entry when inserting; genuine error in the job.

PDOException·SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1525452' for key 'order_addresses_order_id_unique'

@daniel-farina
Copy link

I'm having the same issue.

PDOException: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction in /home/forge/default/releases/20171113162416/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOConnection.php:77

I do run on a cluster with 3 Mysql servers if that has anything to do with it.

Dan

@micmania1
Copy link
Contributor

@daniel-farina does the workaround I posted solve your problem?

@ph4r05
Copy link
Contributor

ph4r05 commented Nov 27, 2017

@micmania1 I tried your workaround and it seems to work, thanks!
Edit: not entirely, see below...

@vingrad
Copy link

vingrad commented Dec 6, 2017

@micmania1 Do you did the changes in the vendor folder?

@victorboissiere
Copy link

@micmania1 if that works I would love a pull request to fix deadlock issues. I have to process 23K jobs.

@ph4r05
Copy link
Contributor

ph4r05 commented Dec 11, 2017

Hi guys,
I've got another deadlock on the system recently even with this new fix.

After running innotop to inspect the deadlock I got this:

_____________________________________________________________________________________________ Deadlock Transactions _____________________________________________________________________________________________
ID  Timestring           User      Host       Victim  Time   Undo  LStrcts  Query Text                                                                                                                           
 4  2017-12-09 09:54:02  keychest  localhost  Yes     00:00     0        3  select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` as
 5  2017-12-09 09:54:02  keychest  localhost  No      00:00     1        3  delete from `jobs` where `id` = ?                                                                                                    

______________________________________ Deadlock Locks _______________________________________
ID  Waiting  Mode  DB        Table  Index                         Special          Ins Intent
 4        1  X     keychest  jobs   PRIMARY                       rec but not gap           0
 5        0  X     keychest  jobs   PRIMARY                       rec but not gap           0
 5        1  X     keychest  jobs   jobs_queue_reserved_at_index  rec but not gap           0

So apparently the delete transaction got the lock on the PRIMARY index and later tried to acquire another index jobs_queue_reserved_at_index. Maybe due to lockForUpdate() on the previous query.

I am wondering why the select itself is needed there. So I changed this:

if ($this->database->table($this->table)->lockForUpdate()->find($id)) {
    $this->database->table($this->table)->where('id', $id)->delete();
}

To this:

$this->database->table($this->table)->where('id', $id)->delete();

And testing it on the dev system.

UPDATE: I am getting different kind of deadlock now:

______________________________________________________________________________________________________ Deadlock Transactions ______________________________________________________________________________________________________
ID    Timestring           User      Host       Victim  Time   Undo  LStrcts  Query Text                                                                                                                                           
8767  2017-12-11 18:08:03  keychest  localhost  No      00:00     1        5  update `jobs` set `reserved_at` = ?, `attempts` = ? where `id` = ?                                                                                   
8768  2017-12-11 18:08:03  keychest  localhost  Yes     00:00     0        4  select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for up

_______________________________________ Deadlock Locks ________________________________________
ID    Waiting  Mode  DB        Table  Index                         Special          Ins Intent
8767        0  X     keychest  jobs   PRIMARY                                                 0
8767        1  X     keychest  jobs   jobs_queue_reserved_at_index  gap before rec            1
8768        1  X     keychest  jobs   PRIMARY                       rec but not gap           0

So two concurrent job selections are locked. One tries to update the selected job while the another one blocks it.

DatabaseQueue.php uses manual transaction management with (beginTransaction(), commit()). I refactored DatabaseQueue::pop so it uses closure like transactions with retry:
$this->database->transaction(closure). It does correct rollback() when exception ocurrs (e.g. database detects deadlock)

Deadlock still happens obviously but workers are able to recover.

@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

For those following this issue; @ph4r05 is a fantastic provider of debugging information. This is the first time someone has checked where the deadlock actually occurs. In his case it's a gap lock on the jobs_queue_reserved_at_index index. This index was removed in the default 5.5 jobs table in acd933f

So, if anyone still have deadlock issues, check your show engine innodb status, check it is a gap lock on the index, and if so, try removing it.

I should probably say something like "make sure you know your database before randomly removing indexes" or "take backups before drinking" or something. This is an internet forum, I'm talking about changes in important system tables. You know the deal. ;)

ph4r05 added a commit to ph4r05/framework that referenced this issue Dec 12, 2017
ph4r05 added a commit to ph4r05/framework that referenced this issue Dec 12, 2017
@ph4r05
Copy link
Contributor

ph4r05 commented Dec 12, 2017

Thanks to the @sisve for discussion. I didn't notice the index was removed as I started the project on older Laravel and database migrations did not remove the index during Laravel upgrade (as far as I am aware) - may be the same case for others in this discussion. After removing the index there are no more deadlocks.


@taylorotwell @20TRIES on a discussion about lock contention in DB workers - for the research and testing purposes I reimplemented pessimistic locking of the database queue to optimistic locking. Take a look at my commit if interested: ph4r05@60fcddd

There are no explicit transactions involved and no DB level locking. DB workers compete for jobs on their version columns, using affected rows information from update call.

I don't think this is a production ready code (don't use without proper testing) and some benchmarking for pessimistic vs optimistic locking would be nice (I can get to that in a few days).

I can submit PR later (once we have benchmarks) if you deem it interesting.

@micmania1
Copy link
Contributor

This is only a theory, but this is what I think is causing the deadlock on delete.

    Queue Worker 1               Queue Worker 2

1    Select (lock)

2    Update

3    Process Job                  Select (lock)

4    Delete Job (locked - fail)   Update

5    Delete Job (success)         Process Job

6                                 Delete Job (success - 0 rows affected)

This would mean that a job could get processed twice. I am running this with the assumption of a transaction retry (ie. the workaround I provided).

Whilst the delete retry solved my problem locally, when we began testing in what will be our prod environment we ran into further issues. We're using a galera cluster with master-master db setup. Locking doesn't work the same way and the default queue simply can't be ran concurrently with that setup.

I've solved this in our project by separating the queue into multiple streams where each worker is responsible for a single stream. When a job is saved it will pick one of the available streams at random. You could potentially do this by using different queues too (one per environment) but I never tried that. This introduces other potential issues (mostly operational), but essentially avoids locking/conflicts completely which is our biggest concern and provided good speed and some scalability.

If you do have the luxury of moving to a different queue system, you should do that.

@ph4r05
Copy link
Contributor

ph4r05 commented Dec 13, 2017

There was a bug in the database queue transaction handling (no rollback) which caused the freezing of the workers. Take a look at the pull request #22394 (maybe check if you have the jobs_queue_reserved_at_index index).

The second worker should not pick the job as it is being reserved. If job is being processed twice reconsider tuning retryAfter parameter.

I am currently using the queue with PR merged and it works perfectly for me.

@sisve
Copy link
Contributor

sisve commented Dec 14, 2017

@micmania1 Use show engine innodb status to debug your deadlock / lock timeout issues. Your theory does not contain the scope of the transactions, and that is important information. There is a SELECT+UPDATE and a SELECT+DELETE that have their own transactions. Activities outside transactions are probably not causing locking issues. (There's technically implicit transactions with auto-commit, so locks held by others are still respected, but the transactions own locks are released after every statement.)

@ph4r05
Copy link
Contributor

ph4r05 commented Dec 15, 2017

@sisve @micmania1 I have a new example of deadlocks on the DatabaseQueue.

With #22394 and #22433 merged the DatabaseQueue workers will recover from the deadlock (no freeze) but some jobs will get processed more than once. Here is why and how:

I created a benchmark project for Laravel queueing for testing of various job processing modifications. The project is fresh Laravel 5.5 (created 1 day ago with composer). I generate a synthetic load and measure correctness and performance. The scenario:

  • 10 workers --queue=high,default,low --sleep=1 --tries=3
  • pre-generate 10000 empty jobs, so we have high concurrency. Queue default, available_at set to the future.
  • after the whole batch is ready in the database, UPDATE jobs SET available_at=0; which causes job processing by workers.
  • innodb_deadlock_detect=1, MySQL 5.7.20, InnoDB
CREATE TABLE `jobs` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `queue` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `payload` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
  `attempts` tinyint(3) unsigned NOT NULL,
  `reserved_at` int(10) unsigned DEFAULT NULL,
  `available_at` int(10) unsigned NOT NULL,
  `created_at` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  KEY `jobs_queue_index` (`queue`)
) ENGINE=InnoDB AUTO_INCREMENT=353855 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Deadlock:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2017-12-15 02:57:39 0x7efd583f3700
*** (1) TRANSACTION:
TRANSACTION 3183282, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 8 row lock(s)
MySQL thread id 1426, OS thread handle 139626575173376, query id 15379859 localhost 127.0.0.1 laravel Sending data
select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 27 page no 3630 n bits 104 index PRIMARY of table `laravel`.`jobs` trx id 3183282 lock_mode X locks rec but not gap waiting
Record lock, heap no 5 PHYSICAL RECORD: n_fields 9; compact format; info bits 32
 0: len 8; hex 00000000000575bb; asc       u ;;

*** (2) TRANSACTION:
TRANSACTION 3183286, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 1437, OS thread handle 139626572363520, query id 15379876 localhost 127.0.0.1 laravel updating
delete from `jobs` where `id` = ?
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 27 page no 3630 n bits 104 index PRIMARY of table `laravel`.`jobs` trx id 3183286 lock_mode X locks rec but not gap
Record lock, heap no 5 PHYSICAL RECORD: n_fields 9; compact format; info bits 32
 0: len 8; hex 00000000000575bb; asc       u ;;
 
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 27 page no 4 n bits 824 index jobs_queue_index of table `laravel`.`jobs` trx id 3183286 lock_mode X locks rec but not gap waiting
Record lock, heap no 719 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 7; hex 64656661756c74; asc default;;
 1: len 8; hex 00000000000575bb; asc       u ;;

*** WE ROLL BACK TRANSACTION (2)

Both transactions are locking the same record. The problem is delete transaction is rolled back. The exception propagates to the worker which continues to the next task. The culprit is an index, this time jobs_queue_index I guess.

The job was already processed but not removed from the database so it will be picked up after it expires. This can happen multiple times.

It would be preferable the transaction 1 is rolled back (select) instead of the delete but I am not aware this is possible to set.

The funny thing is the following hack solves the problem:

public function deleteReserved($queue, $id)
{
    $this->database->transaction(function () use ($queue, $id) {
        $this->database->table($this->table)->where('id', $id)->delete();
    }, 5);
}

As the worker really insists to delete that job. With magic constant 5 I had no deadlock anymore.
There could be implemented some kind of back-off mechanism to try to remove job several times and eventually fail - this could significantly decrease error rate. But it is not a clean approach. (200 jobs per second).

Another way out is adding a new flag field delete_mark - soft delete indicator / 2-stage delete:

public function deleteReserved($queue, $id)
{
    $this->database->transaction(function () use ($queue, $id) {
        $this->database->table($this->table)->where('id', $id)->update(['delete_mark' => 1]);
    });
   
    $this->database->table($this->table)->where('id', $id)->delete();
}   

If worker picks a job with delete_mark > 0 it deletes/ignores it so the job will be processed only once. The update will succeed as it does not modify the index (delete modifies the index). I will test this approach more and if it works correctly I will submit a PR (170 jobs per second).

Another solution is optimistic locking which I am testing right now - no deadlocks. I will publish a blog with benchmarks soon but the performance is not very shiny with the first implementation (slower than pessimistic, 100 jobs per second).

Another solution is removing the index altogether (slow, 90 jobs per second).

If innodb_deadlock_detect=0 then the only options are a) optimistic locking b) removing the index, it seems.

Just for comparison, beanstalkd queue with the same parameters gives 1080 jobs per second.

@sisve
Copy link
Contributor

sisve commented Dec 15, 2017

I'm looking at the innodb status you've provided, but I do not see the actual culprit.

  1. Transaction 1 is attempting to fetch a new job to process, and does this by acquiring an exclusive record lock on primary key. This is normal.
  2. Transaction 2 is trying to delete a job that has been processed and has already acquired the exclusive lock on the primary key that the first transaction is waiting for. It is now blocked waiting for an exclusive lock for the record in jobs_queue_index.

Your status does not tell us who has already acquired a conflicting lock on jobs_queue_index.

These locks are not gap locks, so it's not the same problem as with the jobs_queue_reserved_at_index. Do you perhaps have a open connection from your desktop where you've accidentally started a long-running transaction and caused some locks?

@ph4r05
Copy link
Contributor

ph4r05 commented Dec 15, 2017

@sisve thanks for your time! If you want to take a look at this complete report: https://github.com/ph4r05/laravel-queueing-benchmark/blob/master/.github/deadlock07.txt

I think the index is held by select queries.
I've disabled deadlock detection and increased timeouts so its easier to debug

set global innodb_deadlock_detect=0;
set global innodb_lock_wait_timeout=1000;
set innodb_lock_wait_timeout=1000;

@sisve
Copy link
Contributor

sisve commented Dec 15, 2017

We have two different transactions going on.

  1. Fetch a new job (DatabaseQueue::pop)

    SELECT * FROM jobs WHERE queue = ? AND (...) ORDER BY id ASC LIMIT 1 FOR UPDATE
    UPDATE jobs SET reserved_at = ? WHERE id = ?

  2. Delete a job after it is processed (DatabaseQueue::deleteReserved)

    SELECT * FROM jobs WHERE id = ? FOR UPDATE
    DELETE FROM jobs WHERE id = ?

I'll be guessing more than usual from here on.

It seems like the select that fetches the jobs issues locks on both the jobs_queue_index and the primary key (guessing that it does it in that order), while the select that exists in deleteReserved only locks the primary key. This allows the fetch-select to grab half (the index) of the locks they require and block waiting for the primary key lock, while the deletion transaction has grabbed the primary key lock and is now blocking on the index lock.

So, while I am still guessing wildly, can we trick the database to issue index locks? Can we modify the locking in deleteReserved so it's based on SELECT * FROM jobs WHERE queue = ? AND id = ? FOR UPDATE which would hopefully issue identical locks as the fetch-select.

All this is extremely dependent on the table structure. Any other indexes would probably also add the the chaos and madness. Mmm madness...

@ph4r05
Copy link
Contributor

ph4r05 commented Dec 23, 2017

I made some benchmarks on Amazon c4.large of the Laravel queueing methods (redis, beanstalkd, database):

I also implemented optimistic locking strategy for database driver (no deadlocks), blogpost + repo:

@themsaid
Copy link
Member

Hey guys,

I've managed to test running 500,000 jobs by 15 workers without any deadlocks with the PR (#31287). thoughts please :)

@murphatron
Copy link

murphatron commented Mar 4, 2021

I'm experiencing similar deadlock behavior with Laravel 8 + Mysql 8 with job batching. From what I can tell, it seems the DatabaseBatchRepository could benefit from the same fix applied in @themsaid's PR(#31287) due to job_batches being so transaction+lock heavy in it's interaction. Should I open a new issue for this? I added issue 36478 for my situation.

@anon-pr
Copy link

anon-pr commented Nov 10, 2021

Laravel 8 with MySQL 5.7. 100 jobs with 8 workers and still run into same error. However, i noticed that job/worker ratio is crucial here. When i had 10 jobs on average with 8 workers, i had more deadlock errors. Thought it might help

@Daizygod
Copy link

Hi everyone, in my project I have tow servers with Laravel 10.48.17, MariaDB, supervisor:
first server have around 32 cores, 32 workers
second server have around 8 cores, 8 workers

and there is a problem, deadlocks on delete job happening only on second server,

The only difference between their queue configurations is the number of workers

I already expirement with workers size, but that didnt work, maybe I should test only 1 worker, or upgrade database cpu and memory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests