[DocDB] Rollback only the youngest deadlocked transaction needed to make progress #14165

robertsami · 2022-09-23T16:41:48Z

Jira Link: DB-3646
We currently make a best effort to avoid rolling back transactions unnecessarily in the face of deadlock but we do not guarantee that we won't rollback multiple such transactions. If two deadlock detector instances simultaneously discover the same deadlock with probes starting at different waiting transactions, they may concurrently rollback the waiting transaction they manage. We could instead have an approach where we use some deterministic function to choose which transaction to rollback in a deadlock so that distributed actors will make the same decision and we won't rollback more than one deadlocked transaction

Summary: In the exisiting implementation, we return a generic error message when transactions are aborted due to a deadlock. This diff addresses the issue by returning a detailed error info when transactions are aborted due to a deadlock. Introduced a new flag `clear_deadlocked_txns_info_older_than_seconds`. The deadlock detector now stores transactions aborted due to a deadlock, and we prune the list every `transaction_deadlock_detection_interval_usec`. The info gets removed if it is older than `clear_deadlocked_txns_info_older_than_seconds`. The transaction coordinator now checks with the deadlock detector before reporting `ABORTED` for a txn that it is unable to find. If the txn is found at the deadlock detector, `GetTransactionStatusResponsePB` resp is populated with an newly added field `AppStatusPB reason`, which contains detailed info on the deadlock error. Deadlocked transactions waiting in the wait-queue are aborted in 2 ways: 1. When the txn participant isn't able to find the txn, we directly poll the status tablet for the waiter's status. 2. The txn participant responds with the status it is last aware of (in this case, `ABORTED`). Since both these codepaths use `GetTransactionStatusResponsePB`, we explicitly check if `reason` is populated and return back the detailed error, if any. Note: In some cases, it is still possible that we return a generic error. When a waiter txn is notified that all of its blockers have been aborted, it re-runs conflict resolution. When `PrepareMetadata` is executed, the participant returns a generic error message if the txn is not found. This can happen as the waiter txn too could been aborted as part of the deadlock, but that signal got delayed. If we deterministically abort just one txn when we find a deadlock (#14165), it should probably solve this issue. Jira: DB-3597 Test Plan: Reverting back the newly added test because of another issue #18100 Steps to manually test the feature: Create a cluster using ``` ./bin/yb-ctl create --rf=3 --data_dir ~/yugabyte-data --tserver_flags 'ysql_num_shards_per_tserver=1,enable_wait_queues=true,enable_deadlock_detection=true,enable_intentsdb_page=true' ``` Create a table using ``` create table test (k int primary key, v int); insert into test select generate_series(1,11), 0; ``` Spawn transactions and create deadlocks. Should see error similar to the below ``` ERROR: Transaction 6640330e-d7b6-4c6e-8fec-8d8e2b3311c3 aborted due to deadlock. 6640330e-d7b6-4c6e-8fec-8d8e2b3311c3 -> 12673b67-3ccb-4a90-8a3c-7ed72d92b8c8 -> 6640330e-d7b6-4c6e-8fec-8d8e2b3311c3 : 40P01 ``` Reviewers: rsami, pjain, sergei Reviewed By: rsami Subscribers: esheng, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D26099

robertsami · 2024-01-17T17:05:03Z

workloads which update tables with multiple indexes will run into deadlock frequently without this. we should add a test with a table with six indexes, and parallel updates to that table

cc @pkj415 who is adding a similar test

rthallamko3 · 2024-01-31T20:12:55Z

@Karvy-yb , Do your tests cover this scenario as well? If not, can we add it as a functional test for the next milestone?

Karvy-yb · 2024-02-01T03:32:42Z

@rthallamko3 We have test case running into this issue today but we do not validate that only 1 transaction should be aborted (that too younger). Added this to the milestone along with multi index test case that @robertsami mentioned above.

…king to transaction coordinator Summary: In preparation for the changes being introduced in D31827, this revision moves recently aborted transaction detection to the transaction coordinator with the following changes: 1. Deadlock detector now always sends an RPC to abort a deadlocked transaction, instead of making a local call. This makes the behavior to be introduced in a follow-up diff consistent between remote and local deadlock detectors aborting transactions. 2. If the coordinator receives an AbortTransaction request with a deadlock_reason populated, it updates the local TransactionState with that data and retains the TransactionState in managed_transactions for FLAGS_clear_deadlocked_txns_info_older_than_heartbeats txn heartbeat intervals. 3. UpdateTransaction and GetStatus requests now check the TransactionState for deadlock_reason instead of making a call to the deadlock detector **Upgrade/Rollback safety:** This revision makes an additive change to AbortTransactionRequestPB and does not pose any Upgrade/Downgrade safety issues. Jira: DB-3646, DB-8053 Test Plan: Jenkins Reviewers: bkolagani, sergei Reviewed By: bkolagani Subscribers: bogdan, ybase Differential Revision: https://phorge.dev.yugabyte.com/D31937

Summary: This revision modifies the deadlock detector to abort only the youngest transaction in a detected deadlock. Previously, each deadlock detector would add simply a txn_id to the deadlock probe response. We now include information about the status tablet and txn age. Upon receiving a probe response, the probe origin previously would abort the transaction for which the probe was started, which was guaranteed to be local. We now abort the youngest transaction, which may require a remote AbortTransaction RPC. **Upgrade/Rollback safety:** This revision adds a new "deadlock" field to ProbeTransactionDeadlockResponsePB which contains a new custom DeadlockedTxnInfo message type. This field is meant to replace the existing deadlocked_txn_ids field. The code in this revision updates both fields and handles both fields to ensure upgrade/downgrade safety without the use of any flags. Jira: DB-3646 Test Plan: ybd --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesTest.DeadlockResolvesYoungestTxn Reviewers: bkolagani, pjain Reviewed By: bkolagani Subscribers: yql, ybase, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D31827

…saction in a deadlock" Summary: This reverts commit ad37035. The above commit caused a regression for contentious serializable workloads. Earlier, when deadlock cycles were detected, we used to abort the transaction that originated the probe. But the above commit changes that logic to abort the youngest transaction instead. The problem with this approach is that when the origin transaction is part of two or more deadlock cycles, our deadlock algorithm ignores the callback for all cycles except the first detected cycle. So with the new logic, we end up in a deadlocked state, but don't detect it, leading to requests being timed out in the wait-queue (and new requests coming in, which trigger the probing algorithm again, which is when another deadlock might be detected). Despite the algorithm ignoring callbacks for subsequent cycle detected for a particular probe, we don't have this problem with the earlier algorithm since it aborts the originating transaction, breaking all possible deadlock cycles. The test `testSerializableReadDelayWrite` rightly catches this issue. **Upgrade/Downgrade safety** - The above commit hasn't been part of any official release yet, so reverting the new proto changes for now. This shouldn't have any implications on upgrades/downgrades among stable releases. Jira: DB-10025, DB-3646 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgTransactions#testSerializableReadDelayWrite' Reviewers: rthallam, rsami, hsunder Reviewed By: rthallam, rsami Subscribers: ybase, yql, bogdan Differential Revision: https://phorge.dev.yugabyte.com/D32461

… a deadlock Summary: Note: this revision was originally reverted in 9e07d79, as it caused a legitimate bug in serializable workloads which was originally masked with a test change. In particular, we could get into a situation where a single transaction ends up creating multiple deadlocks at the same time, but since we were only aborting the youngest txn with the original change, and only processing one deadlock per probe, if the offending transaction was not the youngest transaction in the first deadlock detected, it would not be aborted, and the second deadlock would be ignored until the coordinator triggered another probe up to 60s later. We are re-introducing the change here with the test change removed to un-mask the original issue, along with the following fixes: 1. When processing probes, do not drop ProbeResponses from old versions which may not have the deadlock field populated. This was a bug in the initial implementation that was unrelated to the revert 2. If a deadlock is detected, and it is determined that a txn other than the originating txn should be aborted, trigger another probe for the originating transaction in case it was involved in another deadlock which would be later discovered (and ignored) by the original probe 3. In IsAnySubtxnActive, if the transaction is aborted, return false. This was a bug in the original implementation as well Original Summary: This revision modifies the deadlock detector to abort only the youngest transaction in a detected deadlock. Previously, each deadlock detector would add simply a txn_id to the deadlock probe response. We now include information about the status tablet and txn age. Upon receiving a probe response, the probe origin previously would abort the transaction for which the probe was started, which was guaranteed to be local. We now abort the youngest transaction, which may require a remote AbortTransaction RPC. **Upgrade/Rollback safety:** This revision adds a new "deadlock" field to ProbeTransactionDeadlockResponsePB which contains a new custom DeadlockedTxnInfo message type. This field is meant to replace the existing deadlocked_txn_ids field. The code in this revision updates both fields and handles both fields to ensure upgrade/downgrade safety without the use of any flags. Jira: DB-3646 Test Plan: ybd --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesTest.DeadlockResolvesYoungestTxn Reviewers: bkolagani, pjain Reviewed By: bkolagani Subscribers: bogdan, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D32542

…ransaction tracking to transaction coordinator Summary: Original commit: 916da9e / D31937 In preparation for the changes being introduced in D31827, this revision moves recently aborted transaction detection to the transaction coordinator with the following changes: 1. Deadlock detector now always sends an RPC to abort a deadlocked transaction, instead of making a local call. This makes the behavior to be introduced in a follow-up diff consistent between remote and local deadlock detectors aborting transactions. 2. If the coordinator receives an AbortTransaction request with a deadlock_reason populated, it updates the local TransactionState with that data and retains the TransactionState in managed_transactions for FLAGS_clear_deadlocked_txns_info_older_than_heartbeats txn heartbeat intervals. 3. UpdateTransaction and GetStatus requests now check the TransactionState for deadlock_reason instead of making a call to the deadlock detector **Upgrade/Rollback safety:** This revision makes an additive change to AbortTransactionRequestPB and does not pose any Upgrade/Downgrade safety issues. Jira: DB-3646, DB-8053 Test Plan: Jenkins Reviewers: bkolagani, sergei Reviewed By: bkolagani Subscribers: ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33390

…st transaction in a deadlock Summary: Original commit: e8e7394 / D32542 Note: this revision was originally reverted in 9e07d79, as it caused a legitimate bug in serializable workloads which was originally masked with a test change. In particular, we could get into a situation where a single transaction ends up creating multiple deadlocks at the same time, but since we were only aborting the youngest txn with the original change, and only processing one deadlock per probe, if the offending transaction was not the youngest transaction in the first deadlock detected, it would not be aborted, and the second deadlock would be ignored until the coordinator triggered another probe up to 60s later. We are re-introducing the change here with the test change removed to un-mask the original issue, along with the following fixes: 1. When processing probes, do not drop ProbeResponses from old versions which may not have the deadlock field populated. This was a bug in the initial implementation that was unrelated to the revert 2. If a deadlock is detected, and it is determined that a txn other than the originating txn should be aborted, trigger another probe for the originating transaction in case it was involved in another deadlock which would be later discovered (and ignored) by the original probe 3. In IsAnySubtxnActive, if the transaction is aborted, return false. This was a bug in the original implementation as well Original Summary: Original commit: e8e7394 / D32542 This revision modifies the deadlock detector to abort only the youngest transaction in a detected deadlock. Previously, each deadlock detector would add simply a txn_id to the deadlock probe response. We now include information about the status tablet and txn age. Upon receiving a probe response, the probe origin previously would abort the transaction for which the probe was started, which was guaranteed to be local. We now abort the youngest transaction, which may require a remote AbortTransaction RPC. **Upgrade/Rollback safety:** This revision adds a new "deadlock" field to ProbeTransactionDeadlockResponsePB which contains a new custom DeadlockedTxnInfo message type. This field is meant to replace the existing deadlocked_txn_ids field. The code in this revision updates both fields and handles both fields to ensure upgrade/downgrade safety without the use of any flags. Jira: DB-3646 Test Plan: ybd --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesTest.DeadlockResolvesYoungestTxn Reviewers: bkolagani, pjain Reviewed By: bkolagani Subscribers: yql, ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33388

rthallamko3 · 2024-04-03T17:54:03Z

Pending 2.20 backport.

…ter-merge Merge YB master commit 9e07d79 titled [#21058] DocDB: Revert "[#14165] DocDB: Rollback only the newest transaction in a deadlock" and committed 2024-02-15T17:28:56-06:00 into YB pg15. YB pg15 initial merge is 55782d5. - jenkins_jobs.yml: YB master 9f91eeb changes clang16 to clang17. pg15 branch has multiple modifications to this file. Apply both. - yb_uniqkeys.c: yb_is_const_clause_for_distinct_pushdown: YB master 9161aec and YB pg15 40e68e2 touch the same code. Take pg15's version, as suggested by Patnaik. - slot.c: ReplicationSlotCreate: YB master aa3528e adds extra parameter CRSSnapshotAction yb_snapshot_action; upstream PG 19890a064ebf53dedcefed0d8339ed3d449b06e6 adds extra parameter two_phase. Apply both. Maybe two_phase should be passed down to YBCCreateReplicationSlot. - slotfuncs.c: upstream PG 9f06d79ef831ffa333f908f6d3debdb654292414 moves two ReplicationSlotCreate calls into create_physical_replication_slot and create_logical_replication_slot. YB master aa3528e passes CRS_NOEXPORT_SNAPSHOT to ReplicationSlotCreate. Move that and any comments. Note that create_logical_replication_slot and create_physical_replication_slot are also additionally called by copy_replication_slot, and hopefully CRS_NOEXPORT_SNAPSHOT makes sense for that case. - walsender.c: CreateReplicationSlot: (same as slot.c) - elog.c: - function declarations: YB pg15 85f8a82 moves declarations up, so apply the changes of YB master 2d0bd35 there. - yb_write_status_to_server_log: YB pg15 85f8a82 adds call of yb_message_from_status_data that gets renamed to yb_format_and_append by YB master 2d0bd35. - elog_start, elog_finish: upstream PG 17a28b03645e27d73bf69a95d7569b61e58f06eb deletes these functions, so drop changes to elog_finish from YB master 2d0bd35. - EVALUATE_MESSAGE: YB master 2d0bd35 deletes yb_debug_report_error_stacktrace if condition whereas upstream PG d6c55de1f99a9028540516316b95321a7b12a540 removes pfree(fmtbuf). Do both deletions. - Check that YB_EVALUATE_MESSAGE_FROM_STATUS is up-to-date with EVALUATE_MESSAGE: it is now closer due to PG d6c55de1f99a9028540516316b95321a7b12a540 removing formatting code that was never present in YB_EVALUATE_MESSAGE_FROM_STATUS. Update the comment for YB_EVALUATE_MESSAGE_FROM_STATUS accordingly. - Check that yb_additional_errmsg is called wherever needed as explained in the comment. The only new candidate function between PG 11.2 and PG 15.2 is errhint_plural, and that should not get yb_additional_errmsg since it's a hint. - pgbench.c: - includes: upstream PG dddf4cdc3300073ec04b2c3e482a4c1fa4b8191b moves pgbench.h include higher. Adjacent conflict with YB master 6a009b1 adding ysql_bench_metrics_handler.h include. - variable declarations: upstream PG 0e39a608ed5545cc6b9d538ac937c3c1ee8cdc36 adds pg_time_usec_t epoch_shift in the same location YB master 6a009b1 adds YsqlBenchMetricEntry *ysql_bench_metric_entry = NULL. Take both. - usage: YB pg15 initial merge moves --batch-size, introduced by YB master 62fd877, down into "Common options". This is an adjacent conflict with YB master 6a009b1 adding "Prometheus metrics options". - main: - old YB master 62fd877 adds batch-size, old YB master af25ba5 adds max-tries, incoming YB master 6a009b1 adds yb-metrics-bind-address and yb-metrics-bind-port. YB pg15 initial merge preserves max-tries but deletes batch-size (this might be because max-tries was an import that originated from upstream PG while batch-size was not). - It also handles the max-tries case properly in getopt_long while loop but omits the batch-size case. Ignore batch-size for now and try to faithfully preserve the yb-metrics-* options. This requires renumbering the cases. - Upstream PG 547f04e7348b6ed992bd4a197d39661fe7c25097 changes a line to "conn_total_duration = 0" where YB master 6a009b1 adjacently adds "Start metrics webserver". Apply both. - threadRun: YB master 6a009b1 adds "if (ysql_bench_metric_entry)", but upstream PG 9f75e3772350fb66f20a3d7f33bc94f30300d7eb moves all that code into a new function printProgressReport. Move the changes there. It is nontrivial to find where it belongs: after "fprintf(stderr," because that is the original line that was changed by old YB master af25ba5. It appears the changes of that old YB master af25ba5 have not yet been translated to the new location printProgressReport, but ignore this issue for now. - libpq-be.h: YB master 1b784fe adds yb_is_ssl_enabled_in_logical_conn while upstream PG adds authn_id in the same place. Apply both. - slot.h: (same as slot.c) - pg15_tests/test_D31368.sh, pg15_tests/passing_tests.tsv: thanks to YB master e62fdc8, TestPgRegressProc now passes, so move it to passing_tests.tsv.

…bbc01c4' into pg15 Summary: Merge YB master commit 9e07d79 titled [#21058] DocDB: Revert "[#14165] DocDB: Rollback only the newest transaction in a deadlock" and committed 2024-02-15T17:28:56-06:00 into YB pg15. YB pg15 initial merge is 55782d5. - jenkins_jobs.yml: YB master 9f91eeb changes clang16 to clang17. pg15 branch has multiple modifications to this file. Apply both. - yb_uniqkeys.c: yb_is_const_clause_for_distinct_pushdown: YB master 9161aec and YB pg15 40e68e2 touch the same code. Take pg15's version, as suggested by Patnaik. - slot.c: ReplicationSlotCreate: YB master aa3528e adds extra parameter CRSSnapshotAction yb_snapshot_action; upstream PG 19890a064ebf53dedcefed0d8339ed3d449b06e6 adds extra parameter two_phase. Apply both. Maybe two_phase should be passed down to YBCCreateReplicationSlot. - slotfuncs.c: upstream PG 9f06d79ef831ffa333f908f6d3debdb654292414 moves two ReplicationSlotCreate calls into create_physical_replication_slot and create_logical_replication_slot. YB master aa3528e passes CRS_NOEXPORT_SNAPSHOT to ReplicationSlotCreate. Move that and any comments. Note that create_logical_replication_slot and create_physical_replication_slot are also additionally called by copy_replication_slot, and hopefully CRS_NOEXPORT_SNAPSHOT makes sense for that case. - walsender.c: CreateReplicationSlot: (same as slot.c) - elog.c: - function declarations: YB pg15 85f8a82 moves declarations up, so apply the changes of YB master 2d0bd35 there. - yb_write_status_to_server_log: YB pg15 85f8a82 adds call of yb_message_from_status_data that gets renamed to yb_format_and_append by YB master 2d0bd35. - elog_start, elog_finish: upstream PG 17a28b03645e27d73bf69a95d7569b61e58f06eb deletes these functions, so drop changes to elog_finish from YB master 2d0bd35. - EVALUATE_MESSAGE: YB master 2d0bd35 deletes yb_debug_report_error_stacktrace if condition whereas upstream PG d6c55de1f99a9028540516316b95321a7b12a540 removes pfree(fmtbuf). Do both deletions. - Check that YB_EVALUATE_MESSAGE_FROM_STATUS is up-to-date with EVALUATE_MESSAGE: it is now closer due to PG d6c55de1f99a9028540516316b95321a7b12a540 removing formatting code that was never present in YB_EVALUATE_MESSAGE_FROM_STATUS. Update the comment for YB_EVALUATE_MESSAGE_FROM_STATUS accordingly. - Check that yb_additional_errmsg is called wherever needed as explained in the comment. The only new candidate function between PG 11.2 and PG 15.2 is errhint_plural, and that should not get yb_additional_errmsg since it's a hint. - pgbench.c: - includes: upstream PG dddf4cdc3300073ec04b2c3e482a4c1fa4b8191b moves pgbench.h include higher. Adjacent conflict with YB master 6a009b1 adding ysql_bench_metrics_handler.h include. - variable declarations: upstream PG 0e39a608ed5545cc6b9d538ac937c3c1ee8cdc36 adds pg_time_usec_t epoch_shift in the same location YB master 6a009b1 adds YsqlBenchMetricEntry *ysql_bench_metric_entry = NULL. Take both. - usage: YB pg15 initial merge moves --batch-size, introduced by YB master 62fd877, down into "Common options". This is an adjacent conflict with YB master 6a009b1 adding "Prometheus metrics options". - main: - old YB master 62fd877 adds batch-size, old YB master af25ba5 adds max-tries, incoming YB master 6a009b1 adds yb-metrics-bind-address and yb-metrics-bind-port. YB pg15 initial merge preserves max-tries but deletes batch-size (this might be because max-tries was an import that originated from upstream PG while batch-size was not). - It also handles the max-tries case properly in getopt_long while loop but omits the batch-size case. Ignore batch-size for now and try to faithfully preserve the yb-metrics-* options. This requires renumbering the cases. - Upstream PG 547f04e7348b6ed992bd4a197d39661fe7c25097 changes a line to "conn_total_duration = 0" where YB master 6a009b1 adjacently adds "Start metrics webserver". Apply both. - threadRun: YB master 6a009b1 adds "if (ysql_bench_metric_entry)", but upstream PG 9f75e3772350fb66f20a3d7f33bc94f30300d7eb moves all that code into a new function printProgressReport. Move the changes there. It is nontrivial to find where it belongs: after "fprintf(stderr," because that is the original line that was changed by old YB master af25ba5. It appears the changes of that old YB master af25ba5 have not yet been translated to the new location printProgressReport, but ignore this issue for now. - libpq-be.h: YB master 1b784fe adds yb_is_ssl_enabled_in_logical_conn while upstream PG adds authn_id in the same place. Apply both. - slot.h: (same as slot.c) - pg15_tests/test_D31368.sh, pg15_tests/passing_tests.tsv: thanks to YB master e62fdc8, TestPgRegressProc now passes, so move it to passing_tests.tsv. Test Plan: On Almalinux 8: #!/usr/bin/env bash set -eu ./yb_build.sh fastdebug --gcc11 for _ in {1..50}; do grep RegressProc pg15_tests/passing_tests.tsv; done | pg15_tests/run_tests.sh # org.yb.pgsql.TestAlterTableWithConcurrentTxn#testDmlTransactionAfterAlterOnCurrentResourceWithCachedMetadata # might be flaky? pg15_tests/run_tests.sh Tests from 1fc7b74: ./yb_build.sh fastdebug --gcc11 --enable_ysql_conn_mgr_test --java-test org.yb.pgsql.TestLDAPAuth --sj # TestYbRoleProfile without ysql_conn_mgr already fails. #./yb_build.sh fastdebug --gcc11 --enable_ysql_conn_mgr_test --java-test org.yb.pgsql.TestYbRoleProfile --sj Manual test from 2d0bd35: #!/usr/bin/env bash set -eux ./yb_build.sh fastdebug --gcc11 bin/yb-ctl destroy bin/yb-ctl create --ysql_pg_conf_csv yb_debug_report_error_stacktrace=true sleep 5 # errmsg, errdetail, errhint count=$(bin/ysqlsh -c 'create table t (i int)' -c 'alter table t add primary key (i)' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # yb_errmsg_from_status count=$(bin/ysqlsh -c 'insert into t values (1), (1)' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # errmsg_internal count=$(bin/ysqlsh -c 'begin' -c 'set transaction_deferrable = false' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # elog_finish count=$(bin/ysqlsh -c 'create temp table tmp (i int)' -c "select yb_get_range_split_clause('tmp'::regclass)" |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # Workaround for #20820 (comment) sleep 2 # errmsg_plural count=$(bin/ysqlsh -c 'create table r (i int primary key)' -c 'create table f1 (i int references r)' -c 'create table f2 (i int references r)' -c 'drop table r cascade' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] Running other test plan tests, got 1 2024-04-09T15:50:35-07:00 JAVA org.yb.pgsql.TestPgRegressDistinctPushdown 0 2024-04-09T15:50:57-07:00 JAVA org.yb.ysqlconnmgr.TestPreparedStatements#testPreparedQuery 0 2024-04-09T15:51:05-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.VerifySafeTimes 0 2024-04-09T15:51:12-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.VerifyBasicJsonParsing 0 2024-04-09T15:51:19-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.VerifyAlreadyProcessed 0 2024-04-09T15:51:27-07:00 integration-tests_xcluster_safe_time_service-test XClusterSafeTimeServiceTest.ComputeSafeTime 0 2024-04-09T15:51:35-07:00 integration-tests_xcluster_safe_time_service-test XClusterSafeTimeServiceTest.ComputeSafeTimeWithFilters 0 2024-04-09T15:51:43-07:00 integration-tests_xcluster_safe_time_service-test XClusterSafeTimeServiceTest.ComputeSafeTimeWithFiltersSingleTablet 0 2024-04-09T15:52:14-07:00 pgwrapper_pg_get_lock_status-test PgGetLockStatusTest.TestWaiterLockContainingColumnId 0 2024-04-09T15:52:27-07:00 tserver_clone_operation-test CloneOperationTest.Hardlink 8 2024-04-09T15:52:40-07:00 tserver_clone_operation-test CloneOperationTest.CloneAndRestore 8 2024-04-09T15:52:52-07:00 tserver_clone_operation-test CloneOperationTest.CloneOnSameDrive 0 2024-04-09T15:53:16-07:00 integration-tests_clone-tablet-itest CloneTabletExternalItest/CloneTabletExternalCrashItest.CrashDuringApply/0 0 2024-04-09T15:54:29-07:00 integration-tests_xcluster_ysql_index-test XClusterDbScopedYsqlIndexTest.CreateIndex 0 2024-04-09T15:55:46-07:00 integration-tests_xcluster_ysql_index-test XClusterDbScopedYsqlIndexTest.CreateIndexWithWorkload 0 2024-04-09T15:57:55-07:00 integration-tests_xcluster_db_scoped-test XClusterDBScopedTest.CreateTable 0 2024-04-09T15:58:04-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.SkipScanWhenNoNewRecords 0 2024-04-09T15:59:51-07:00 JAVA org.yb.pgsql.TestPgWriteRestart 1 2024-04-09T16:31:44-07:00 JAVA org.yb.pgsql.TestPgEncryption 0 2024-04-09T16:32:52-07:00 JAVA org.yb.pgsql.TestPgDdlConcurrency#testModifiedTableWrite 0 2024-04-09T16:38:02-07:00 JAVA org.yb.pgsql.ConcurrentTablespaceTest - TestPgRegressDistinctPushdown fails because EXPLAIN differences - The two Clone tests fail with "Bad status: Not found (yb/tserver/ts_tablet_manager.cc:2256): Tablet target_tablet_id not found" - TestPgEncryption fails with "Connection to 127.0.0.182:27726 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections." Jenkins: rebase: pg15 Reviewers: aagrawal, tfoucher, xCluster, hsunder Reviewed By: tfoucher Subscribers: patnaik.balivada, ybase, ycdcxcluster, yql Differential Revision: https://phorge.dev.yugabyte.com/D33933

… transaction in a deadlock Summary: Original commit: e8e7394 / D32542 Note: this revision was originally reverted in 9e07d79, as it caused a legitimate bug in serializable workloads which was originally masked with a test change. In particular, we could get into a situation where a single transaction ends up creating multiple deadlocks at the same time, but since we were only aborting the youngest txn with the original change, and only processing one deadlock per probe, if the offending transaction was not the youngest transaction in the first deadlock detected, it would not be aborted, and the second deadlock would be ignored until the coordinator triggered another probe up to 60s later. We are re-introducing the change here with the test change removed to un-mask the original issue, along with the following fixes: 1. When processing probes, do not drop ProbeResponses from old versions which may not have the deadlock field populated. This was a bug in the initial implementation that was unrelated to the revert 2. If a deadlock is detected, and it is determined that a txn other than the originating txn should be aborted, trigger another probe for the originating transaction in case it was involved in another deadlock which would be later discovered (and ignored) by the original probe 3. In IsAnySubtxnActive, if the transaction is aborted, return false. This was a bug in the original implementation as well Original Summary: Original commit: e8e7394 / D32542 This revision modifies the deadlock detector to abort only the youngest transaction in a detected deadlock. Previously, each deadlock detector would add simply a txn_id to the deadlock probe response. We now include information about the status tablet and txn age. Upon receiving a probe response, the probe origin previously would abort the transaction for which the probe was started, which was guaranteed to be local. We now abort the youngest transaction, which may require a remote AbortTransaction RPC. **Upgrade/Rollback safety:** This revision adds a new "deadlock" field to ProbeTransactionDeadlockResponsePB which contains a new custom DeadlockedTxnInfo message type. This field is meant to replace the existing deadlocked_txn_ids field. The code in this revision updates both fields and handles both fields to ensure upgrade/downgrade safety without the use of any flags. Jira: DB-3646 Test Plan: ybd --cxx-test pgwrapper_pg_wait_on_conflict-test --gtest_filter PgWaitQueuesTest.DeadlockResolvesYoungestTxn Reviewers: bkolagani, pjain Reviewed By: bkolagani Subscribers: yql, ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33495

robertsami self-assigned this Sep 23, 2022

This was referenced Sep 23, 2022

[YSQL] Support wait-on-conflict concurrency control #5680

Closed

[YSQL][PessimisticLocking] Transaction is being aborted even after deadlock resolution #14134

Closed

rthallamko3 added the area/docdb YugabyteDB core features label Sep 23, 2022

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage and removed status/awaiting-triage Issue awaiting triage labels Sep 23, 2022

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Oct 5, 2022

jharveysmith added this to Wait-Queue Based Locking Aug 16, 2023

jharveysmith moved this to Backlog in Wait-Queue Based Locking Aug 16, 2023

robertsami moved this from Backlog to Pending in Wait-Queue Based Locking Aug 21, 2023

robertsami added the 2.20 Backport Required label Jan 17, 2024

robertsami moved this from Pending to In Progress in Wait-Queue Based Locking Jan 19, 2024

robertsami changed the title ~~[dst] Rollback only the minimal set of deadlocked transactions needed to make progress~~ [dst] Rollback only the most recent set of deadlocked transactions needed to make progress Jan 22, 2024

robertsami changed the title ~~[dst] Rollback only the most recent set of deadlocked transactions needed to make progress~~ [dst] Rollback only the youngest deadlocked transaction needed to make progress Jan 22, 2024

rthallamko3 moved this from In Progress to In Review in Wait-Queue Based Locking Jan 23, 2024

rthallamko3 changed the title ~~[dst] Rollback only the youngest deadlocked transaction needed to make progress~~ [DocDB] Rollback only the youngest deadlocked transaction needed to make progress Jan 29, 2024

robertsami mentioned this issue Feb 7, 2024

[DocDB] Use pg_txn_start_us for determining txn start time in deadlock detector #20976

Open

1 task

basavaraj29 moved this from In Review to Backporting in Wait-Queue Based Locking Feb 14, 2024

basavaraj29 moved this from Backporting to In Progress in Wait-Queue Based Locking Feb 15, 2024

yugabyte-ci added kind/bug This issue is a bug and removed kind/enhancement This is an enhancement of an existing feature labels Mar 14, 2024

robertsami moved this from In Progress to In Review in Wait-Queue Based Locking Mar 18, 2024

robertsami moved this from In Review to Backporting in Wait-Queue Based Locking Mar 20, 2024

robertsami added the 2024.1_blocker label Mar 20, 2024

robertsami closed this as completed Apr 11, 2024

github-project-automation bot moved this from Backporting to Done in Wait-Queue Based Locking Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Rollback only the youngest deadlocked transaction needed to make progress #14165

[DocDB] Rollback only the youngest deadlocked transaction needed to make progress #14165

robertsami commented Sep 23, 2022 •

edited by jira bot

Loading

robertsami commented Jan 17, 2024 •

edited

Loading

rthallamko3 commented Jan 31, 2024

Karvy-yb commented Feb 1, 2024 •

edited

Loading

rthallamko3 commented Apr 3, 2024

[DocDB] Rollback only the youngest deadlocked transaction needed to make progress #14165

[DocDB] Rollback only the youngest deadlocked transaction needed to make progress #14165

Comments

robertsami commented Sep 23, 2022 • edited by jira bot Loading

robertsami commented Jan 17, 2024 • edited Loading

rthallamko3 commented Jan 31, 2024

Karvy-yb commented Feb 1, 2024 • edited Loading

rthallamko3 commented Apr 3, 2024

robertsami commented Sep 23, 2022 •

edited by jira bot

Loading

robertsami commented Jan 17, 2024 •

edited

Loading

Karvy-yb commented Feb 1, 2024 •

edited

Loading