Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.5] database queue - transactions management fix #22394

Merged
merged 1 commit into from
Dec 14, 2017

Conversation

ph4r05
Copy link
Contributor

@ph4r05 ph4r05 commented Dec 12, 2017

EDIT, TLDR:
The bug: rollback() is not called for an opened transaction on $this->database object in DatabaseQueue if an Exception is thrown from any reason (e.g., deadlock, timeout). This leaves database object in inconsistent state (no commit() nor rollback() for started transaction). This is problem for DB workers / daemons.

5.5 solution: fixing transaction management by wrapping tsx in try/catch and adding rollback() to catch block. No breaking changes.

5.6 solution (potential): if 5.5 gets merged, refactor DatabaseQueue a bit so transactions are wrapped in closures - cleaner and safer. Removing nonintuitive commit() in marshallJob() (possibly breaking change).

@ph4r05 ph4r05 force-pushed the 5.5_database-queue-deadlock-fix branch 2 times, most recently from 88802fa to 46046e7 Compare December 12, 2017 10:19
@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

Slightly relevant: #22369

I would like to repeat my comment from the linked pr;

I dislike this change just because it has a mentality of "try it again" instead of figuring out the actual problem. Could you describe your table structure and what indexes you have? What in your show engine innodb status indicates that this solution is appropriate? What types of locks are involved in your deadlocks?

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

Thanks for the comment!

By default there is attempt=1 so it is not "try it again" approach I guess. Wrapping transactions in the closures is more clean and safe coding style on its own IMHO. And moreover even with attempt=1 it actually solves the problem. Workers are not frozen for hours but recover from the deadlock.

  • In my setup I have 5 queues: high, default, low, emails, sshpool
  • 4x workers for --queue=high,default,low --sleep=1 --tries=3
  • 4x workers for --queue=emails --sleep=1 --tries=3
  • 4x workers for --queue=sshpool --sleep=1 --tries=3

Without my patch the workers freeze, transactions are timeouting in 50 seconds (default InnoDB), workers are stuck. WIth the patch deadlock happens but the worker fails normally and is able to recover (deadlock is detected under second rather than 50 seconds timeout).

My jobs table (ordinary):

 CREATE TABLE `jobs` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `queue` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
  `payload` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
  `attempts` tinyint(3) unsigned NOT NULL,
  `reserved_at` int(10) unsigned DEFAULT NULL,
  `available_at` int(10) unsigned NOT NULL,
  `created_at` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  KEY `jobs_queue_reserved_at_index` (`queue`,`reserved_at`)
) ENGINE=InnoDB AUTO_INCREMENT=639674 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Here are deadlocks I've been experiencing (innotop):

_____________________________________________________________________________________________ Deadlock Transactions _____________________________________________________________________________________________
ID  Timestring           User      Host       Victim  Time   Undo  LStrcts  Query Text                                                                                                                           
 4  2017-12-09 09:54:02  keychest  localhost  Yes     00:00     0        3  select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` as
 5  2017-12-09 09:54:02  keychest  localhost  No      00:00     1        3  delete from `jobs` where `id` = ?                                                                                                    

______________________________________ Deadlock Locks _______________________________________
ID  Waiting  Mode  DB        Table  Index                         Special          Ins Intent
 4        1  X     keychest  jobs   PRIMARY                       rec but not gap           0
 5        0  X     keychest  jobs   PRIMARY                       rec but not gap           0
 5        1  X     keychest  jobs   jobs_queue_reserved_at_index  rec but not gap           0

And another one:

______________________________________________________________________________________________________ Deadlock Transactions ______________________________________________________________________________________________________
ID    Timestring           User      Host       Victim  Time   Undo  LStrcts  Query Text                                                                                                                                           
8767  2017-12-11 18:08:03  keychest  localhost  No      00:00     1        5  update `jobs` set `reserved_at` = ?, `attempts` = ? where `id` = ?                                                                                   
8768  2017-12-11 18:08:03  keychest  localhost  Yes     00:00     0        4  select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for up

_______________________________________ Deadlock Locks ________________________________________
ID    Waiting  Mode  DB        Table  Index                         Special          Ins Intent
8767        0  X     keychest  jobs   PRIMARY                                                 0
8767        1  X     keychest  jobs   jobs_queue_reserved_at_index  gap before rec            1
8768        1  X     keychest  jobs   PRIMARY                       rec but not gap           0

Locks:

_____________________________________________________ InnoDB Locks _____________________________________________________
ID     Type    Waiting  Wait   Active    Mode  DB        Table  Index                         Ins Intent  Special       
 8767  RECORD        1  00:48  02:32:10  X     keychest  jobs   jobs_queue_reserved_at_index           0                
 8766  RECORD        1  00:30  02:27:51  X     keychest  jobs   jobs_queue_reserved_at_index           0                
10578  RECORD        1  00:22     00:22  X     keychest  jobs   jobs_queue_reserved_at_index           1  gap before rec
10579  RECORD        1  00:13     00:13  X     keychest  jobs   jobs_queue_reserved_at_index           1  gap before rec

Processes:

+-------+-----------+-----------------+----------+---------+------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Id    | User      | Host            | db       | Command | Time | State          | Info                                                                                                                                                      |
+-------+-----------+-----------------+----------+---------+------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|  8766 | keychest  | localhost:40468 | keychest | Execute |    4 | Sorting result | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
|  8767 | keychest  | localhost:40472 | keychest | Execute |   22 | Sorting result | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 10581 | keychest  | localhost:36070 | keychest | Execute |   25 | update         | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                |
| 10582 | keychest  | localhost:36072 | keychest | Execute |   16 | update         | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                |
+-------+-----------+-----------------+----------+---------+------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+

Transactions:

+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| waiting_trx_id | waiting_thread | waiting_query                                                                                                                                             | blocking_trx_id | blocking_thread | blocking_query                                                                                                                                            |
+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C97B1         |            8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9348         |            8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62D36A9        |          10593 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C97B1         |            8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9348         |            8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62D36A0        |          10592 | insert into `jobs` (`queue`, `payload`, `attempts`, `reserved_at`, `available_at`, `created_at`) values (?, ?, ?, ?, ?, ?)                                | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62C97B1        |           8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update | 62C9345         |            8768 | NULL                                                                                                                                                      |
| 62C9348        |           8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update | 62C97B1         |            8766 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update |
| 62C9348        |           8767 | select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update | 62C9345         |            8768 | NULL                                                                                                                                                      |
+----------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
11 rows in set (0.00 sec)

Lock info:

+---------------------+-------------+-----------+-----------+-------------------+--------------------------------+------------+-----------+----------+------------------------------+
| lock_id             | lock_trx_id | lock_mode | lock_type | lock_table        | lock_index                     | lock_space | lock_page | lock_rec | lock_data                    |
+---------------------+-------------+-----------+-----------+-------------------+--------------------------------+------------+-----------+----------+------------------------------+
| 62C97B1:0:52555:193 | 62C97B1     | X         | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9348:0:52555:193 | 62C9348     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C97B1:0:52555:193 | 62C97B1     | X         | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9348:0:52555:193 | 62C9348     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C97B1:0:52555:193 | 62C97B1     | X         | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
| 62C9345:0:52555:193 | 62C9345     | X,GAP     | RECORD    | `keychest`.`jobs` | `jobs_queue_reserved_at_index` |          0 |     52555 |      193 | 'sshpool', 1513015683, 18744 |
+---------------------+-------------+-----------+-----------+-------------------+--------------------------------+------------+-----------+----------+------------------------------+

@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

It seems you have gap locks on the jobs_queue_reserved_at_index index. This has been removed in the default jobs table in 5.5. Try removing it.

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

@sisve Thanks for the suggestion, removing the index would definitelly fix the problem.

Lets investigate transaction usage in the DatabaseQueue for a bit - I just want to be sure there is no potential bug.

Function marshalJob() contains the following code:

$job = $this->markJobAsReserved($job);

$this->database->commit();

If deadlock happens (on the index which is removed now) or any other exception is thrown the commit() is not triggered, nor rollback in the laravel DB layer. Connection state (e.g., Connection::transactions) remains in 1 which causes further problems with commits and events, if I get the transaction processing correctly.

Another fact is that there are 2 commits() in DatabaseQueue for the same transaction, each in a different method. One in pop(), another one in marshalJob.

I believe wrapping transactions in the closure is more clean and can avoid potential problems in the future even though the index was removed and obvious deadlock is avoided.

Does it make sense?

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

@sisve can you pls check out the transaction wrapping in closures reasoning? I still believe the current version of DatabaseQueue is prone to errors in the future (even if it is working fine right now), does dangerous transaction state management and the transaction processing is not well readable (2 scattered commits).

I can remove transactionAttempt variable from the PR so it is 1 by default -> the same functionality like it had before the PR but with cleaner transaction processing.

Thanks for giving it a thought.

@ph4r05 ph4r05 force-pushed the 5.5_database-queue-deadlock-fix branch from 46046e7 to b3ca36b Compare December 12, 2017 11:37
@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

Wrapping the transactional stuff in a call to $this->database->transaction(...) absolutely makes sense. However, in this case it means that the behavior of the DatabaseQueue changes; marshalJob(...) no longer commits the current transaction. I cannot recall that I've seen any third party code that depends on this behavior, but I havn't been actively looking either. This indicates that the code should target the 5.6 branch to reduce the number of broken packages if this were to released in the middle of the 5.5 lifecycle.

However, if we were to rewrite the call to ->transaction() into manual handling with try/catch/finally and keep the call to commit() in marshalJob we could still target the 5.5 branch.

@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

To be clear, it's the behavior change of the marshalJob() that indicates that we should target 5.6, not that I havn't been looking for third party code that calls it. ;)

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

Aah I didnt realize this could be a breaking change as marshalJob() does not do commit() on its own.

So it probably makes more sense to make it 5.6 ? Or what do you think would be the best?

@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

I would like to see this being merged into the 5.5 branch as I believe this is a important bugfix of how we handle deadlocks (and other problems) when calling DatabaseQueue::pop(). I imagine that an implementation would look like this, without changing marshalJob:

    try {
        $this->database->beginTransaction();

        if ($job = $this->getNextAvailableJob($queue)) {
            return $this->marshalJob($queue, $job);
        }

        $this->database->commit();
    } catch (\Throwable $ex) {
        $this->database->rollBack();
        throw $ex;
    }

@ph4r05 ph4r05 force-pushed the 5.5_database-queue-deadlock-fix branch from b3ca36b to 02fe24a Compare December 12, 2017 12:33
@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

Thanks for reconsidering! (PR updated)

@ph4r05 ph4r05 changed the title [5.5] database queue - transactions wrapped in closures, retry attempts [5.5] database queue - transactions management fix Dec 12, 2017
@deleugpn
Copy link
Contributor

Can someone clarify how this fixes anything? The current code already starts a transaction and commit after. If anything wrong happens, the commit will never be called, which means the database will roll it back anyway?

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

@deleugpn I will try to clarify:

@sisve pls correct me If I am wrong but if $this->markJobAsReserved($job); fails for some reason (e.g., database killing it) an Exception is thrown. There is no $this->database->rollBack(); call in the pop() call tree for the started transaction which leaves $this->database in the inconsisent state - e.g., number of active transactions.

@deleugpn
Copy link
Contributor

deleugpn commented Dec 12, 2017

DatabaseQueue uses Illuminate\Database\Connection, which uses PDO.

The Database Queue flow without your changes already starts a transaction and tries to commit. if any error happens before $this->database->commit(), an exception will be thrown that will bubble up. Once the framework shuts down and PHP runs the garbage collector, the database connection will be closed. PDO will automatically rollback any transaction uncommitted.

When the script ends or when a connection is about to be closed, if you have an outstanding transaction, PDO will automatically roll it back. This is a safety measure to help avoid inconsistency in the cases where the script terminates unexpectedly--if you didn't explicitly commit the transaction, then it is assumed that something went awry, so the rollback is performed for the safety of your data.

My point is: you're not fixing anything. This already happens. I still think this should be merged because explicit code demonstrating the behavior is easier to understand, but the awareness that I want to raise is that whatever bug you're experiencing now will still happen because a rollback is already happening, unless I'm missing something.

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

@deleugpn thanks for the comment. I dont think my patch does not fix anything:

I have my DB worker running indefinitely so there is no framework shutdown happening and database object is left in inconsisent state which is a problem for the worker and for the subsequent commits() on the database object in the worker - it still thinks there is a transaction in progress (transaction counter was not reset).

I think it is totally normal to do transactions in try/catch/finally and call rollback() appropriatelly.

@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

@ph4r05 is correct, if an exception is thrown (due to timeouts, deadlocks and whatnot) there will never be any call to commit, nor rollback. The call to DatabaseQueue::pop() is primarily done from Worker::getNextJob(), and there are some exception handling there that handles lost database connections. That wont catch these specific deadlock issues.

@deleugpn Your argument is correct assuming that the framework would shut down; but it doesn't. Laravel catches these in Worker::getNextJob() and never rethrows them.

protected function getNextJob($connection, $queue)
{
try {
foreach (explode(',', $queue) as $queue) {
if (! is_null($job = $connection->pop($queue))) {
return $job;
}
}
} catch (Exception $e) {
$this->exceptions->report($e);
$this->stopWorkerIfLostConnection($e);
} catch (Throwable $e) {
$this->exceptions->report($e = new FatalThrowableError($e));
$this->stopWorkerIfLostConnection($e);
}
}

Since DatabaseQueue::pop() uses a transaction to both fetch the job, and mark it as reserved, I believe that it is also the responsibility of pop() to reset (rollback) the connection if something failed.

It is true that if a transaction is rolled back if it isn't explicitly committed, but that logic needs to know when to decide that it wasn't committed. This is usually done when the connection is closed or disconnected. However, the worker has the same connection for a long time in daemon mode, so the database only believes that we started a transaction, then did something weird, and that it is business as usual. It does not know that we semantically only wanted a small portion of our code in a transaction and then roll it back.

Note that we've (including me) keeps talking about deadlocks, but it is probably only a "lock wait timeout". And these are of course special for mysql, it only rolls back the single statement, not the entire active transaction.

The length of time in seconds an InnoDB transaction waits for a row lock before giving up. The default value is 50 seconds. A transaction that tries to access a row that is locked by another InnoDB transaction waits at most this many seconds for write access to the row before issuing the following error:

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

When a lock wait timeout occurs, the current statement is rolled back (not the entire transaction). To have the entire transaction roll back, start the server with the --innodb_rollback_on_timeout option. See also Section 14.21.4, “InnoDB Error Handling”.

[...]

The lock wait timeout value does not apply to deadlocks when innodb_deadlock_detect is enabled (the default) because InnoDB detects deadlocks immediately and rolls back one of the deadlocked transactions. When innodb_deadlock_detect is disabled, InnoDB relies on innodb_lock_wait_timeout for transaction rollback when a deadlock occurs. See Section 14.5.5.2, “Deadlock Detection and Rollback”.

Source: https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout

Note that innodb_deadlock_detect was introduced in 5.7.15.

@deleugpn
Copy link
Contributor

Thanks @sisve and @ph4r05. The Daemon mode was the missing piece of the puzzle that never closes the connection. Awesome discussion and PR.

return $this->marshalJob($queue, $job);
}
if ($job = $this->getNextAvailableJob($queue)) {
return $this->marshalJob($queue, $job);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return statement will prevent the $this->database->commit(); from ever executing. However, there's also a commit call inside marshalJob(). That could lead to confusion. Can we remove the commit call from marshalJob and avoid this early return with a temporarily variable?

Copy link
Contributor Author

@ph4r05 ph4r05 Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we discussed removing commit with @sisve - he correctly pointed out that it might be a breaking change as the marshalJob() does not commit the transaction anymore (if any thirdparty code happen to rely on that).

This breaking change could be added to 5.6 branch and I would rather then wrap the whole transaction in the closure to make it cleaner.


$this->database->commit();
$this->database->commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed @sisve comment. If we want to keep the call to commit inside marshalJob, we need to not commit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marshalJob only executes if we've found a job, we still need to call commit() here to release any locks acquired by getNextAvailableJob

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. If getNextAvailableJob() returns null the commit in marshalJob is not triggered :/

The commit here terminates the transaction correctly if there are no jobs in DB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True that. A little confusing. Would be nice to get this merged into 5.5 and another PR that clean this up targeting 5.6 (after this gets merged).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deleugpn thanks that would be great I think.

*/
public function deleteReserved($queue, $id)
{
$this->database->beginTransaction();

if ($this->database->table($this->table)->lockForUpdate()->find($id)) {
Copy link
Contributor Author

@ph4r05 ph4r05 Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to preserve the original code and just fix the transaction management. But I am wondering what is the reason to lock a job for update by id when the only action then is delete by id.

I cannot see the reason why not just do delete by id directly in the transaction. Deleting directly would be faster and I think it is functionally equivalent. Just wondering (not needed to change it in this PR). Any thoughts? Maybe in 5.6?

Copy link
Contributor

@deleugpn deleugpn Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Retry setup: 10 sec
  • Worker A picks job 1 and takes 9.9 seconds to finish it.
  • When worker A is about to delete job 1, worker B wants to pop a job and now job 1 is eligible to be picked.

In this scenario, without the lock I think you run an incredibly small chance of Worker B picking the job right before Worker A deletes it.

Copy link
Contributor Author

@ph4r05 ph4r05 Dec 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deleugpn Thanks for the example! Hmm. If the retry is 10 second the job is eligible for re-run (is considered expired) after 10 seconds worker A claimed it even if the job is still running on worker A. This could be a problem of the timing and probably should not happen. Usually its not desired to process the same job twice and concurrently. I guess one should set retry time so that it is greater than any job that can potentially run.

If I get it right, this preemption can happen even with locking - before A execution flow reaches deleteReserved() it is not locked. The lock protects job from being taken by B only on those 3 lines. I am not sure this was intended protection for this. If we remove the find() completely worker A is more likely to delete it in time (no locking, straight delete).

On the other hand, imagine the the worker B took the job that is still being processed by A. A will reach deleteReserved() and deletes the job even though it is still being processed by B - also not very nice but at least the job gets removed after all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK lets wrap this up so we can merge and maybe discuss later for 5.6 patch. Thanks!

@taylorotwell
Copy link
Member

This PR needs a TLDR after all of that discussion.

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

@taylorotwell TLDR for me:

The bug: rollback() is not called for an opened transaction on $this->database object in DatabaseQueue if an Exception is thrown from any reason (e.g., deadlock, timeout). This leaves database object in inconsistent state (no commit() nor rollback() for started transaction). This is problem for DB workers / daemons.

5.5 solution: fixing transaction management by wrapping tsx in try/catch and adding rollback() to catch block. No breaking changes.

5.6 solution: if 5.5 gets merged, refactor DatabaseQueue a bit so transactions are wrapped in closures - cleaner and safer. Removing nonintuitive commit() in marshallJob() (possibly breaking change).

@ph4r05
Copy link
Contributor Author

ph4r05 commented Dec 12, 2017

@sisve is it required to confirm review of source codes? Thanks! :)

@sisve
Copy link
Contributor

sisve commented Dec 12, 2017

@ph4r05 I'll admit that I'm not really sure how the code reviews works on GitHub. The current changes looks good and all tests are passing. Your TL;DR is correct.

@laurencei
Copy link
Contributor

laurencei commented Dec 14, 2017

@sisve

Note that innodb_deadlock_detect was introduced in 5.7.15.

So what happens if someone is running < 5.7.15 with this code? Will it break anything? Or will it continue as it was before?

edit: looking into it more - it looks like it has no effect. But just double checking to be sure.

@taylorotwell taylorotwell merged commit 02fe24a into laravel:5.5 Dec 14, 2017
@sisve
Copy link
Contributor

sisve commented Dec 14, 2017

I think most people run a version without innodb_deadlock_detect; this means that they have the current initially reported issue - a long lock wait timeout and then an error message. This is the weird scenario where mysql doesn't revert the entire scenario, just the statement.

Those running mysql with innodb_deadlock_detect would probably, if I read the docs right, have an instant failure and a complete transaction rollback. It would work for them as we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants