Improve aborted transactions cleanup #5669

spolitov · 2020-09-14T12:36:46Z

No description provided.

Summary: When a transaction is aborted its intents should be cleaned from participating tablets. And the transaction itself should be unloaded from memory. This diff fixes various scenarios when transactions were not cleaned: 1) Added a cleanup cache to transaction participant. So it will be able to clean up transaction when a cleanup request is received before the transaction was replicated to this node. 2) Fix state check for cleanup when the transaction heartbeat failed. 3) Clean up a transaction that failed to commit. 4) Attempt to clean up tablets that were not marked as having metadata, because in case of failure with child transaction we could write intents to them, while reporting overall operation as failed. 5) Clean up tablets involved in a child transaction when it fails. Test Plan: ybd --gtest_filter CqlIndexTest.TxnCleanup Reviewers: timur, mikhail Reviewed By: mikhail Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9358

…up aborted ones Summary: This diff adds a periodic status check of each running transaction to transaction participant. This is needed to detect transactions that have been aborted and abandoned more proactively. Such cases might happen when the transaction client has crashed, so that there is no one to send a cleanup RPC to the transaction participant. Previously, we would have to wait for a compaction for those transactions' intents to be cleaned up. The cleanup mechanism works as follows. Every running transaction now has an associated scheduled abort check hybrid time, abort_check_ht, which we set to start time + FLAGS_transaction_abort_check_interval_ms when the transaction starts. We keep resetting it to current time + the same interval FLAGS_transaction_abort_check_interval_ms when we receive a response saying the transaction is still pending. As a result of this, in the normal situation with no network disconnections or slowness, we check the status of each pending transaction once per FLAGS_transaction_abort_check_interval_ms milliseconds on average. In case of slow status request processing, we wait for the previous status request to time out (as per FLAGS_transaction_abort_check_timeout_ms flag) before scheduling a new status check for the same transaction. To efficiently implement the above polling mechanism, we use rpc::Poller and rpc::Scheduler to invoke a Poll function every FLAGS_transactions_status_poll_interval_ms milliseconds. This polling interval is much smaller the per-transaction status check interval FLAGS_transaction_abort_check_timeout_ms. This function uses the new sequential index on abort_check_ht that is being added to the transactions_ multi-index container in TransactionParticipant to obtain the set of transactions that are due for status check at this iteration. Also, in this diff we extract the code in TransactionParticipant that loads transaction metadata from intents RocksDB and large-transaction "apply metadata" from regular RocksDB into memory to a new class TransactionLoader. Test Plan: ybd --gtest_filter CqlIndexTest.TxnPollCleanup Reviewers: mikhail Reviewed By: mikhail Subscribers: bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D9427

spolitov added kind/bug This issue is a bug area/docdb YugabyteDB core features labels Sep 14, 2020

spolitov self-assigned this Sep 14, 2020

kmuthukk added the priority/high High Priority label Sep 18, 2020

spolitov closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve aborted transactions cleanup #5669

Improve aborted transactions cleanup #5669

spolitov commented Sep 14, 2020

Improve aborted transactions cleanup #5669

Improve aborted transactions cleanup #5669

Comments

spolitov commented Sep 14, 2020