-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[5.5] database queue - transactions management fix #22394
[5.5] database queue - transactions management fix #22394
Conversation
88802fa
to
46046e7
Compare
Slightly relevant: #22369 I would like to repeat my comment from the linked pr;
|
Thanks for the comment! By default there is attempt=1 so it is not "try it again" approach I guess. Wrapping transactions in the closures is more clean and safe coding style on its own IMHO. And moreover even with attempt=1 it actually solves the problem. Workers are not frozen for hours but recover from the deadlock.
Without my patch the workers freeze, transactions are timeouting in 50 seconds (default InnoDB), workers are stuck. WIth the patch deadlock happens but the worker fails normally and is able to recover (deadlock is detected under second rather than 50 seconds timeout). My jobs table (ordinary): CREATE TABLE `jobs` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`queue` varchar(191) COLLATE utf8mb4_unicode_ci NOT NULL,
`payload` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
`attempts` tinyint(3) unsigned NOT NULL,
`reserved_at` int(10) unsigned DEFAULT NULL,
`available_at` int(10) unsigned NOT NULL,
`created_at` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `jobs_queue_reserved_at_index` (`queue`,`reserved_at`)
) ENGINE=InnoDB AUTO_INCREMENT=639674 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci Here are deadlocks I've been experiencing (innotop):
And another one:
Locks:
Processes:
Transactions:
Lock info:
|
It seems you have gap locks on the |
@sisve Thanks for the suggestion, removing the index would definitelly fix the problem. Lets investigate transaction usage in the Function $job = $this->markJobAsReserved($job);
$this->database->commit(); If deadlock happens (on the index which is removed now) or any other exception is thrown the Another fact is that there are 2 I believe wrapping transactions in the closure is more clean and can avoid potential problems in the future even though the index was removed and obvious deadlock is avoided. Does it make sense? |
@sisve can you pls check out the transaction wrapping in closures reasoning? I still believe the current version of I can remove Thanks for giving it a thought. |
46046e7
to
b3ca36b
Compare
Wrapping the transactional stuff in a call to However, if we were to rewrite the call to |
To be clear, it's the behavior change of the marshalJob() that indicates that we should target 5.6, not that I havn't been looking for third party code that calls it. ;) |
Aah I didnt realize this could be a breaking change as So it probably makes more sense to make it 5.6 ? Or what do you think would be the best? |
I would like to see this being merged into the 5.5 branch as I believe this is a important bugfix of how we handle deadlocks (and other problems) when calling DatabaseQueue::pop(). I imagine that an implementation would look like this, without changing marshalJob:
|
b3ca36b
to
02fe24a
Compare
Thanks for reconsidering! (PR updated) |
Can someone clarify how this fixes anything? The current code already starts a transaction and commit after. If anything wrong happens, the commit will never be called, which means the database will roll it back anyway? |
@deleugpn I will try to clarify: @sisve pls correct me If I am wrong but if |
DatabaseQueue uses Illuminate\Database\Connection, which uses PDO. The Database Queue flow without your changes already starts a transaction and tries to commit. if any error happens before
My point is: you're not fixing anything. This already happens. I still think this should be merged because explicit code demonstrating the behavior is easier to understand, but the awareness that I want to raise is that whatever bug you're experiencing now will still happen because a rollback is already happening, unless I'm missing something. |
@deleugpn thanks for the comment. I dont think my patch does not fix anything: I have my DB worker running indefinitely so there is no framework shutdown happening and database object is left in inconsisent state which is a problem for the worker and for the subsequent commits() on the database object in the worker - it still thinks there is a transaction in progress (transaction counter was not reset). I think it is totally normal to do transactions in try/catch/finally and call rollback() appropriatelly. |
@ph4r05 is correct, if an exception is thrown (due to timeouts, deadlocks and whatnot) there will never be any call to commit, nor rollback. The call to DatabaseQueue::pop() is primarily done from Worker::getNextJob(), and there are some exception handling there that handles lost database connections. That wont catch these specific deadlock issues. @deleugpn Your argument is correct assuming that the framework would shut down; but it doesn't. Laravel catches these in Worker::getNextJob() and never rethrows them. framework/src/Illuminate/Queue/Worker.php Lines 240 to 257 in 242d438
Since DatabaseQueue::pop() uses a transaction to both fetch the job, and mark it as reserved, I believe that it is also the responsibility of pop() to reset (rollback) the connection if something failed. It is true that if a transaction is rolled back if it isn't explicitly committed, but that logic needs to know when to decide that it wasn't committed. This is usually done when the connection is closed or disconnected. However, the worker has the same connection for a long time in daemon mode, so the database only believes that we started a transaction, then did something weird, and that it is business as usual. It does not know that we semantically only wanted a small portion of our code in a transaction and then roll it back. Note that we've (including me) keeps talking about deadlocks, but it is probably only a "lock wait timeout". And these are of course special for mysql, it only rolls back the single statement, not the entire active transaction.
Source: https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout Note that innodb_deadlock_detect was introduced in 5.7.15. |
return $this->marshalJob($queue, $job); | ||
} | ||
if ($job = $this->getNextAvailableJob($queue)) { | ||
return $this->marshalJob($queue, $job); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return
statement will prevent the $this->database->commit();
from ever executing. However, there's also a commit
call inside marshalJob()
. That could lead to confusion. Can we remove the commit call from marshalJob
and avoid this early return with a temporarily variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we discussed removing commit
with @sisve - he correctly pointed out that it might be a breaking change as the marshalJob()
does not commit the transaction anymore (if any thirdparty code happen to rely on that).
This breaking change could be added to 5.6 branch and I would rather then wrap the whole transaction in the closure to make it cleaner.
|
||
$this->database->commit(); | ||
$this->database->commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed @sisve comment. If we want to keep the call to commit
inside marshalJob
, we need to not commit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
marshalJob only executes if we've found a job, we still need to call commit() here to release any locks acquired by getNextAvailableJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. If getNextAvailableJob()
returns null the commit in marshalJob
is not triggered :/
The commit here terminates the transaction correctly if there are no jobs in DB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that. A little confusing. Would be nice to get this merged into 5.5 and another PR that clean this up targeting 5.6 (after this gets merged).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deleugpn thanks that would be great I think.
*/ | ||
public function deleteReserved($queue, $id) | ||
{ | ||
$this->database->beginTransaction(); | ||
|
||
if ($this->database->table($this->table)->lockForUpdate()->find($id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to preserve the original code and just fix the transaction management. But I am wondering what is the reason to lock a job for update by id
when the only action then is delete by id
.
I cannot see the reason why not just do delete by id
directly in the transaction. Deleting directly would be faster and I think it is functionally equivalent. Just wondering (not needed to change it in this PR). Any thoughts? Maybe in 5.6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Retry setup: 10 sec
- Worker A picks job 1 and takes 9.9 seconds to finish it.
- When worker A is about to delete job 1, worker B wants to pop a job and now job 1 is eligible to be picked.
In this scenario, without the lock I think you run an incredibly small chance of Worker B picking the job right before Worker A deletes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deleugpn Thanks for the example! Hmm. If the retry is 10 second the job is eligible for re-run (is considered expired) after 10 seconds worker A claimed it even if the job is still running on worker A. This could be a problem of the timing and probably should not happen. Usually its not desired to process the same job twice and concurrently. I guess one should set retry time so that it is greater than any job that can potentially run.
If I get it right, this preemption can happen even with locking - before A execution flow reaches deleteReserved()
it is not locked. The lock protects job from being taken by B only on those 3 lines. I am not sure this was intended protection for this. If we remove the find()
completely worker A is more likely to delete it in time (no locking, straight delete).
On the other hand, imagine the the worker B took the job that is still being processed by A. A will reach deleteReserved()
and deletes the job even though it is still being processed by B - also not very nice but at least the job gets removed after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK lets wrap this up so we can merge and maybe discuss later for 5.6 patch. Thanks!
This PR needs a TLDR after all of that discussion. |
@taylorotwell TLDR for me: The bug: 5.5 solution: fixing transaction management by wrapping tsx in try/catch and adding 5.6 solution: if 5.5 gets merged, refactor |
@sisve is it required to confirm review of source codes? Thanks! :) |
@ph4r05 I'll admit that I'm not really sure how the code reviews works on GitHub. The current changes looks good and all tests are passing. Your TL;DR is correct. |
So what happens if someone is running < 5.7.15 with this code? Will it break anything? Or will it continue as it was before? edit: looking into it more - it looks like it has no effect. But just double checking to be sure. |
I think most people run a version without innodb_deadlock_detect; this means that they have the current initially reported issue - a long lock wait timeout and then an error message. This is the weird scenario where mysql doesn't revert the entire scenario, just the statement. Those running mysql with innodb_deadlock_detect would probably, if I read the docs right, have an instant failure and a complete transaction rollback. It would work for them as we want. |
EDIT, TLDR:
The bug:
rollback()
is not called for an opened transaction on$this->database
object inDatabaseQueue
if an Exception is thrown from any reason (e.g., deadlock, timeout). This leaves database object in inconsistent state (nocommit()
norrollback()
for started transaction). This is problem for DB workers / daemons.5.5 solution: fixing transaction management by wrapping tsx in try/catch and adding
rollback()
to catch block. No breaking changes.5.6 solution (potential): if 5.5 gets merged, refactor
DatabaseQueue
a bit so transactions are wrapped in closures - cleaner and safer. Removing nonintuitivecommit()
inmarshallJob()
(possibly breaking change).