[5.5] database queue - transactions management fix #22394

ph4r05 · 2017-12-12T10:15:43Z

Database transactions management fix, rollback on exception.
fixes Laravel 5. PDOException. QUEUE_DRIVER=database 1213 Deadlock #7046. With this patch job workers are able to recover from transaction deadlocks (instead of freezing)

EDIT, TLDR:
The bug: rollback() is not called for an opened transaction on $this->database object in DatabaseQueue if an Exception is thrown from any reason (e.g., deadlock, timeout). This leaves database object in inconsistent state (no commit() nor rollback() for started transaction). This is problem for DB workers / daemons.

5.5 solution: fixing transaction management by wrapping tsx in try/catch and adding rollback() to catch block. No breaking changes.

5.6 solution (potential): if 5.5 gets merged, refactor DatabaseQueue a bit so transactions are wrapped in closures - cleaner and safer. Removing nonintuitive commit() in marshallJob() (possibly breaking change).

sisve · 2017-12-12T10:58:45Z

Slightly relevant: #22369

I would like to repeat my comment from the linked pr;

I dislike this change just because it has a mentality of "try it again" instead of figuring out the actual problem. Could you describe your table structure and what indexes you have? What in your show engine innodb status indicates that this solution is appropriate? What types of locks are involved in your deadlocks?

ph4r05 · 2017-12-12T11:06:57Z

Thanks for the comment!

By default there is attempt=1 so it is not "try it again" approach I guess. Wrapping transactions in the closures is more clean and safe coding style on its own IMHO. And moreover even with attempt=1 it actually solves the problem. Workers are not frozen for hours but recover from the deadlock.

In my setup I have 5 queues: high, default, low, emails, sshpool
4x workers for --queue=high,default,low --sleep=1 --tries=3
4x workers for --queue=emails --sleep=1 --tries=3
4x workers for --queue=sshpool --sleep=1 --tries=3

Without my patch the workers freeze, transactions are timeouting in 50 seconds (default InnoDB), workers are stuck. WIth the patch deadlock happens but the worker fails normally and is able to recover (deadlock is detected under second rather than 50 seconds timeout).

My jobs table (ordinary):

 CREATE TABLE `jobs` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `queue` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
  `payload` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
  `attempts` tinyint(3) unsigned NOT NULL,
  `reserved_at` int(10) unsigned DEFAULT NULL,
  `available_at` int(10) unsigned NOT NULL,
  `created_at` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  KEY `jobs_queue_reserved_at_index` (`queue`,`reserved_at`)
) ENGINE=InnoDB AUTO_INCREMENT=639674 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Here are deadlocks I've been experiencing (innotop):

_____________________________________________________________________________________________ Deadlock Transactions _____________________________________________________________________________________________
ID  Timestring           User      Host       Victim  Time   Undo  LStrcts  Query Text                                                                                                                           
 4  2017-12-09 09:54:02  keychest  localhost  Yes     00:00     0        3  select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` as
 5  2017-12-09 09:54:02  keychest  localhost  No      00:00     1        3  delete from `jobs` where `id` = ?                                                                                                    

______________________________________ Deadlock Locks _______________________________________
ID  Waiting  Mode  DB        Table  Index                         Special          Ins Intent
 4        1  X     keychest  jobs   PRIMARY                       rec but not gap           0
 5        0  X     keychest  jobs   PRIMARY                       rec but not gap           0
 5        1  X     keychest  jobs   jobs_queue_reserved_at_index  rec but not gap           0

And another one:

______________________________________________________________________________________________________ Deadlock Transactions ______________________________________________________________________________________________________
ID    Timestring           User      Host       Victim  Time   Undo  LStrcts  Query Text                                                                                                                                           
8767  2017-12-11 18:08:03  keychest  localhost  No      00:00     1        5  update `jobs` set `reserved_at` = ?, `attempts` = ? where `id` = ?                                                                                   
8768  2017-12-11 18:08:03  keychest  localhost  Yes     00:00     0        4  select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for up

_______________________________________ Deadlock Locks ________________________________________
ID    Waiting  Mode  DB        Table  Index                         Special          Ins Intent
8767        0  X     keychest  jobs   PRIMARY                                                 0
8767        1  X     keychest  jobs   jobs_queue_reserved_at_index  gap before rec            1
8768        1  X     keychest  jobs   PRIMARY                       rec but not gap           0

Locks:

_____________________________________________________ InnoDB Locks _____________________________________________________
ID     Type    Waiting  Wait   Active    Mode  DB        Table  Index                         Ins Intent  Special       
 8767  RECORD        1  00:48  02:32:10  X     keychest  jobs   jobs_queue_reserved_at_index           0                
 8766  RECORD        1  00:30  02:27:51  X     keychest  jobs   jobs_queue_reserved_at_index           0                
10578  RECORD        1  00:22     00:22  X     keychest  jobs   jobs_queue_reserved_at_index           1  gap before rec
10579  RECORD        1  00:13     00:13  X     keychest  jobs   jobs_queue_reserved_at_index           1  gap before rec

Processes:

+-------+-----------+-----------------+----------+---------+------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Id    | User      | Host            | db       | Command | Time | State          | Info                                                                                                                                                      |
+-------+-----------+-----------------+----------+---------+------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|  8766 | keychest  | localhost:40468 | keychest | Execute |    4 | Sorting result | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
|  8767 | keychest  | localhost:40472 | keychest | Execute |   22 | Sorting result | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 10581 | keychest  | localhost:36070 | keychest | Execute |   25 | update         | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                |
| 10582 | keychest  | localhost:36072 | keychest | Execute |   16 | update         | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                |
+-------+-----------+-----------------+----------+---------+------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+

Transactions:

+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| waiting_trx_id | waiting_thread | waiting_query                                                                                                                                             | blocking_trx_id | blocking_thread | blocking_query                                                                                                                                            |
+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C97B1         |            8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9348         |            8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C97B1         |            8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9348         |            8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62C97B1        |           8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62C9348        |           8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update | 62C97B1         |            8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62C9348        |           8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update | 62C9345         |            8768 | NULL                                                                                                                                                      |
+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
11 rows in set (0.00 sec)

Lock info:

+---------------------+-------------+-----------+-----------+-------------------+--------------------------------+------------+-----------+----------+------------------------------+
| lock_id             | lock_trx_id | lock_mode | lock_type | lock_table        | lock_index                     | lock_space | lock_page | lock_rec | lock_data                    |
+---------------------+-------------+-----------+-----------+-------------------+--------------------------------+------------+-----------+----------+------------------------------+
| 62C97B1:0:52555:193 | 62C97B1     | X         | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9348:0:52555:193 | 62C9348     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C97B1:0:52555:193 | 62C97B1     | X         | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9348:0:52555:193 | 62C9348     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C97B1:0:52555:193 | 62C97B1     | X         | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
+---------------------+-------------+-----------+-----------+-------------------+--------------------------------+------------+-----------+----------+------------------------------+

sisve · 2017-12-12T11:18:56Z

It seems you have gap locks on the jobs_queue_reserved_at_index index. This has been removed in the default jobs table in 5.5. Try removing it.

ph4r05 · 2017-12-12T11:30:34Z

@sisve Thanks for the suggestion, removing the index would definitelly fix the problem.

Lets investigate transaction usage in the DatabaseQueue for a bit - I just want to be sure there is no potential bug.

Function marshalJob() contains the following code:

$job = $this->markJobAsReserved($job);

$this->database->commit();

If deadlock happens (on the index which is removed now) or any other exception is thrown the commit() is not triggered, nor rollback in the laravel DB layer. Connection state (e.g., Connection::transactions) remains in 1 which causes further problems with commits and events, if I get the transaction processing correctly.

Another fact is that there are 2 commits() in DatabaseQueue for the same transaction, each in a different method. One in pop(), another one in marshalJob.

I believe wrapping transactions in the closure is more clean and can avoid potential problems in the future even though the index was removed and obvious deadlock is avoided.

Does it make sense?

ph4r05 · 2017-12-12T11:34:46Z

@sisve can you pls check out the transaction wrapping in closures reasoning? I still believe the current version of DatabaseQueue is prone to errors in the future (even if it is working fine right now), does dangerous transaction state management and the transaction processing is not well readable (2 scattered commits).

I can remove transactionAttempt variable from the PR so it is 1 by default -> the same functionality like it had before the PR but with cleaner transaction processing.

Thanks for giving it a thought.

sisve · 2017-12-12T11:41:34Z

Wrapping the transactional stuff in a call to $this->database->transaction(...) absolutely makes sense. However, in this case it means that the behavior of the DatabaseQueue changes; marshalJob(...) no longer commits the current transaction. I cannot recall that I've seen any third party code that depends on this behavior, but I havn't been actively looking either. This indicates that the code should target the 5.6 branch to reduce the number of broken packages if this were to released in the middle of the 5.5 lifecycle.

However, if we were to rewrite the call to ->transaction() into manual handling with try/catch/finally and keep the call to commit() in marshalJob we could still target the 5.5 branch.

sisve · 2017-12-12T11:45:21Z

To be clear, it's the behavior change of the marshalJob() that indicates that we should target 5.6, not that I havn't been looking for third party code that calls it. ;)

ph4r05 · 2017-12-12T12:04:48Z

Aah I didnt realize this could be a breaking change as marshalJob() does not do commit() on its own.

So it probably makes more sense to make it 5.6 ? Or what do you think would be the best?

sisve · 2017-12-12T12:13:14Z

I would like to see this being merged into the 5.5 branch as I believe this is a important bugfix of how we handle deadlocks (and other problems) when calling DatabaseQueue::pop(). I imagine that an implementation would look like this, without changing marshalJob:

    try {
        $this->database->beginTransaction();

        if ($job = $this->getNextAvailableJob($queue)) {
            return $this->marshalJob($queue, $job);
        }

        $this->database->commit();
    } catch (\Throwable $ex) {
        $this->database->rollBack();
        throw $ex;
    }

… fixed - fixes laravel#7046

ph4r05 · 2017-12-12T12:35:18Z

Thanks for reconsidering! (PR updated)

deleugpn · 2017-12-12T13:23:55Z

Can someone clarify how this fixes anything? The current code already starts a transaction and commit after. If anything wrong happens, the commit will never be called, which means the database will roll it back anyway?

ph4r05 · 2017-12-12T13:30:27Z

@deleugpn I will try to clarify:

@sisve pls correct me If I am wrong but if $this->markJobAsReserved($job); fails for some reason (e.g., database killing it) an Exception is thrown. There is no $this->database->rollBack(); call in the pop() call tree for the started transaction which leaves $this->database in the inconsisent state - e.g., number of active transactions.

deleugpn · 2017-12-12T13:55:10Z

DatabaseQueue uses Illuminate\Database\Connection, which uses PDO.

The Database Queue flow without your changes already starts a transaction and tries to commit. if any error happens before $this->database->commit(), an exception will be thrown that will bubble up. Once the framework shuts down and PHP runs the garbage collector, the database connection will be closed. PDO will automatically rollback any transaction uncommitted.

When the script ends or when a connection is about to be closed, if you have an outstanding transaction, PDO will automatically roll it back. This is a safety measure to help avoid inconsistency in the cases where the script terminates unexpectedly--if you didn't explicitly commit the transaction, then it is assumed that something went awry, so the rollback is performed for the safety of your data.

My point is: you're not fixing anything. This already happens. I still think this should be merged because explicit code demonstrating the behavior is easier to understand, but the awareness that I want to raise is that whatever bug you're experiencing now will still happen because a rollback is already happening, unless I'm missing something.

ph4r05 · 2017-12-12T13:58:01Z

@deleugpn thanks for the comment. I dont think my patch does not fix anything:

I have my DB worker running indefinitely so there is no framework shutdown happening and database object is left in inconsisent state which is a problem for the worker and for the subsequent commits() on the database object in the worker - it still thinks there is a transaction in progress (transaction counter was not reset).

I think it is totally normal to do transactions in try/catch/finally and call rollback() appropriatelly.

sisve · 2017-12-12T13:59:38Z

@ph4r05 is correct, if an exception is thrown (due to timeouts, deadlocks and whatnot) there will never be any call to commit, nor rollback. The call to DatabaseQueue::pop() is primarily done from Worker::getNextJob(), and there are some exception handling there that handles lost database connections. That wont catch these specific deadlock issues.

@deleugpn Your argument is correct assuming that the framework would shut down; but it doesn't. Laravel catches these in Worker::getNextJob() and never rethrows them.

framework/src/Illuminate/Queue/Worker.php

Lines 240 to 257 in 242d438

    
           protected function getNextJob($connection, $queue) 
        
           { 
        
               try { 
        
                   foreach (explode(',', $queue) as $queue) { 
        
                       if (! is_null($job = $connection->pop($queue))) { 
        
                           return $job; 
        
                       } 
        
                   } 
        
               } catch (Exception $e) { 
        
                   $this->exceptions->report($e); 
        
                   $this->stopWorkerIfLostConnection($e); 
        
               } catch (Throwable $e) { 
        
                   $this->exceptions->report($e = new FatalThrowableError($e)); 
        
                   $this->stopWorkerIfLostConnection($e); 
        
               } 
        
           }

Since DatabaseQueue::pop() uses a transaction to both fetch the job, and mark it as reserved, I believe that it is also the responsibility of pop() to reset (rollback) the connection if something failed.

It is true that if a transaction is rolled back if it isn't explicitly committed, but that logic needs to know when to decide that it wasn't committed. This is usually done when the connection is closed or disconnected. However, the worker has the same connection for a long time in daemon mode, so the database only believes that we started a transaction, then did something weird, and that it is business as usual. It does not know that we semantically only wanted a small portion of our code in a transaction and then roll it back.

Note that we've (including me) keeps talking about deadlocks, but it is probably only a "lock wait timeout". And these are of course special for mysql, it only rolls back the single statement, not the entire active transaction.

The length of time in seconds an InnoDB transaction waits for a row lock before giving up. The default value is 50 seconds. A transaction that tries to access a row that is locked by another InnoDB transaction waits at most this many seconds for write access to the row before issuing the following error:
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
When a lock wait timeout occurs, the current statement is rolled back (not the entire transaction). To have the entire transaction roll back, start the server with the --innodb_rollback_on_timeout option. See also Section 14.21.4, “InnoDB Error Handling”.

[...]

The lock wait timeout value does not apply to deadlocks when innodb_deadlock_detect is enabled (the default) because InnoDB detects deadlocks immediately and rolls back one of the deadlocked transactions. When innodb_deadlock_detect is disabled, InnoDB relies on innodb_lock_wait_timeout for transaction rollback when a deadlock occurs. See Section 14.5.5.2, “Deadlock Detection and Rollback”.

Source: https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout

Note that innodb_deadlock_detect was introduced in 5.7.15.

deleugpn · 2017-12-12T14:03:17Z

Thanks @sisve and @ph4r05. The Daemon mode was the missing piece of the puzzle that never closes the connection. Awesome discussion and PR.

deleugpn · 2017-12-12T14:12:02Z

src/Illuminate/Queue/DatabaseQueue.php

-            return $this->marshalJob($queue, $job);
-        }
+            if ($job = $this->getNextAvailableJob($queue)) {
+                return $this->marshalJob($queue, $job);


The return statement will prevent the $this->database->commit(); from ever executing. However, there's also a commit call inside marshalJob(). That could lead to confusion. Can we remove the commit call from marshalJob and avoid this early return with a temporarily variable?

we discussed removing commit with @sisve - he correctly pointed out that it might be a breaking change as the marshalJob() does not commit the transaction anymore (if any thirdparty code happen to rely on that).

This breaking change could be added to 5.6 branch and I would rather then wrap the whole transaction in the closure to make it cleaner.

deleugpn · 2017-12-12T14:15:10Z

src/Illuminate/Queue/DatabaseQueue.php


-        $this->database->commit();
+            $this->database->commit();


I just noticed @sisve comment. If we want to keep the call to commit inside marshalJob, we need to not commit here.

marshalJob only executes if we've found a job, we still need to call commit() here to release any locks acquired by getNextAvailableJob

Hmm. If getNextAvailableJob() returns null the commit in marshalJob is not triggered :/

The commit here terminates the transaction correctly if there are no jobs in DB.

True that. A little confusing. Would be nice to get this merged into 5.5 and another PR that clean this up targeting 5.6 (after this gets merged).

@deleugpn thanks that would be great I think.

ph4r05 · 2017-12-12T14:43:18Z

src/Illuminate/Queue/DatabaseQueue.php

     */
    public function deleteReserved($queue, $id)
    {
-        $this->database->beginTransaction();
-
-        if ($this->database->table($this->table)->lockForUpdate()->find($id)) {


I've decided to preserve the original code and just fix the transaction management. But I am wondering what is the reason to lock a job for update by id when the only action then is delete by id.

I cannot see the reason why not just do delete by id directly in the transaction. Deleting directly would be faster and I think it is functionally equivalent. Just wondering (not needed to change it in this PR). Any thoughts? Maybe in 5.6?

Retry setup: 10 sec

Worker A picks job 1 and takes 9.9 seconds to finish it.

When worker A is about to delete job 1, worker B wants to pop a job and now job 1 is eligible to be picked.

In this scenario, without the lock I think you run an incredibly small chance of Worker B picking the job right before Worker A deletes it.

@deleugpn Thanks for the example! Hmm. If the retry is 10 second the job is eligible for re-run (is considered expired) after 10 seconds worker A claimed it even if the job is still running on worker A. This could be a problem of the timing and probably should not happen. Usually its not desired to process the same job twice and concurrently. I guess one should set retry time so that it is greater than any job that can potentially run.

If I get it right, this preemption can happen even with locking - before A execution flow reaches deleteReserved() it is not locked. The lock protects job from being taken by B only on those 3 lines. I am not sure this was intended protection for this. If we remove the find() completely worker A is more likely to delete it in time (no locking, straight delete).

On the other hand, imagine the the worker B took the job that is still being processed by A. A will reach deleteReserved() and deletes the job even though it is still being processed by B - also not very nice but at least the job gets removed after all.

OK lets wrap this up so we can merge and maybe discuss later for 5.6 patch. Thanks!

taylorotwell · 2017-12-12T14:49:12Z

This PR needs a TLDR after all of that discussion.

ph4r05 · 2017-12-12T14:57:15Z

@taylorotwell TLDR for me:

The bug: rollback() is not called for an opened transaction on $this->database object in DatabaseQueue if an Exception is thrown from any reason (e.g., deadlock, timeout). This leaves database object in inconsistent state (no commit() nor rollback() for started transaction). This is problem for DB workers / daemons.

5.5 solution: fixing transaction management by wrapping tsx in try/catch and adding rollback() to catch block. No breaking changes.

5.6 solution: if 5.5 gets merged, refactor DatabaseQueue a bit so transactions are wrapped in closures - cleaner and safer. Removing nonintuitive commit() in marshallJob() (possibly breaking change).

ph4r05 · 2017-12-12T18:55:16Z

@sisve is it required to confirm review of source codes? Thanks! :)

sisve · 2017-12-12T19:02:10Z

@ph4r05 I'll admit that I'm not really sure how the code reviews works on GitHub. The current changes looks good and all tests are passing. Your TL;DR is correct.

laurencei · 2017-12-14T01:10:34Z

@sisve

Note that innodb_deadlock_detect was introduced in 5.7.15.

So what happens if someone is running < 5.7.15 with this code? Will it break anything? Or will it continue as it was before?

edit: looking into it more - it looks like it has no effect. But just double checking to be sure.

sisve · 2017-12-14T04:28:18Z

I think most people run a version without innodb_deadlock_detect; this means that they have the current initially reported issue - a long lock wait timeout and then an error message. This is the weird scenario where mysql doesn't revert the entire scenario, just the statement.

Those running mysql with innodb_deadlock_detect would probably, if I read the docs right, have an instant failure and a complete transaction rollback. It would work for them as we want.

ph4r05 force-pushed the 5.5_database-queue-deadlock-fix branch 2 times, most recently from 88802fa to 46046e7 Compare December 12, 2017 10:19

sisve mentioned this pull request Dec 12, 2017

[5.5] Add retries to delete reserved on jobs table to fix dead lock issues on heavy queue loads #22369

Closed

ph4r05 force-pushed the 5.5_database-queue-deadlock-fix branch from 46046e7 to b3ca36b Compare December 12, 2017 11:37

5.5 database queue - transactions wrapped in closures, tsx management…

02fe24a

… fixed - fixes laravel#7046

ph4r05 force-pushed the 5.5_database-queue-deadlock-fix branch from b3ca36b to 02fe24a Compare December 12, 2017 12:33

ph4r05 changed the title ~~[5.5] database queue - transactions wrapped in closures, retry attempts~~ [5.5] database queue - transactions management fix Dec 12, 2017

deleugpn suggested changes Dec 12, 2017

View reviewed changes

deleugpn reviewed Dec 12, 2017

View reviewed changes

deleugpn approved these changes Dec 12, 2017

View reviewed changes

ph4r05 commented Dec 12, 2017

View reviewed changes

ph4r05 mentioned this pull request Dec 13, 2017

Laravel 5. PDOException. QUEUE_DRIVER=database 1213 Deadlock #7046

Closed

taylorotwell merged commit 02fe24a into laravel:5.5 Dec 14, 2017

ph4r05 mentioned this pull request Dec 14, 2017

[5.6] database queue transaction wrap #22433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.5] database queue - transactions management fix #22394

[5.5] database queue - transactions management fix #22394

ph4r05 commented Dec 12, 2017 •

edited

Loading

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 •

edited

Loading

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

sisve commented Dec 12, 2017

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 •

edited

Loading

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 •

edited

Loading

deleugpn commented Dec 12, 2017

ph4r05 commented Dec 12, 2017

deleugpn commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

sisve commented Dec 12, 2017

deleugpn commented Dec 12, 2017

deleugpn Dec 12, 2017

ph4r05 Dec 12, 2017 •

edited

Loading

deleugpn Dec 12, 2017

sisve Dec 12, 2017

ph4r05 Dec 12, 2017

deleugpn Dec 12, 2017

ph4r05 Dec 12, 2017

ph4r05 Dec 12, 2017 •

edited

Loading

deleugpn Dec 12, 2017 •

edited

Loading

ph4r05 Dec 12, 2017 •

edited

Loading

ph4r05 Dec 12, 2017

taylorotwell commented Dec 12, 2017

ph4r05 commented Dec 12, 2017

ph4r05 commented Dec 12, 2017

sisve commented Dec 12, 2017

laurencei commented Dec 14, 2017 •

edited

Loading

sisve commented Dec 14, 2017

[5.5] database queue - transactions management fix #22394

[5.5] database queue - transactions management fix #22394

Conversation

ph4r05 commented Dec 12, 2017 • edited Loading

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 • edited Loading

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 • edited Loading

ph4r05 commented Dec 12, 2017 • edited Loading

sisve commented Dec 12, 2017

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 • edited Loading

sisve commented Dec 12, 2017

ph4r05 commented Dec 12, 2017 • edited Loading

deleugpn commented Dec 12, 2017

ph4r05 commented Dec 12, 2017

deleugpn commented Dec 12, 2017 • edited Loading

ph4r05 commented Dec 12, 2017 • edited Loading

sisve commented Dec 12, 2017

deleugpn commented Dec 12, 2017

deleugpn Dec 12, 2017

Choose a reason for hiding this comment

ph4r05 Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

deleugpn Dec 12, 2017

Choose a reason for hiding this comment

sisve Dec 12, 2017

Choose a reason for hiding this comment

ph4r05 Dec 12, 2017

Choose a reason for hiding this comment

deleugpn Dec 12, 2017

Choose a reason for hiding this comment

ph4r05 Dec 12, 2017

Choose a reason for hiding this comment

ph4r05 Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

deleugpn Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

ph4r05 Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

ph4r05 Dec 12, 2017

Choose a reason for hiding this comment

taylorotwell commented Dec 12, 2017

ph4r05 commented Dec 12, 2017

ph4r05 commented Dec 12, 2017

sisve commented Dec 12, 2017

laurencei commented Dec 14, 2017 • edited Loading

sisve commented Dec 14, 2017

ph4r05 commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

deleugpn commented Dec 12, 2017 •

edited

Loading

ph4r05 commented Dec 12, 2017 •

edited

Loading

ph4r05 Dec 12, 2017 •

edited

Loading

ph4r05 Dec 12, 2017 •

edited

Loading

deleugpn Dec 12, 2017 •

edited

Loading

ph4r05 Dec 12, 2017 •

edited

Loading

laurencei commented Dec 14, 2017 •

edited

Loading