-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Check failed: !options.ddl_mode() Restarting a DDL transaction not supported #20820
Comments
On release builds, this returns an error for the create-index statement. Which can be handled by the test accordingly.
|
@amitanandaiyer if you extract your new test out and apply it to a clean latest checkout, can you reproduce the bug? |
Here is a repro on likely-unrelated changes on top of recent master 0214819. Almalinux 8, fastdebug, gcc11 build. bin/yb-ctl destroy
bin/yb-ctl create
while bin/ysqlsh -c 'create table r (i int primary key)' -c 'create table f1 (i int references r)' -c 'create table f2 (i int references r)' -c 'drop table r cascade' -c 'drop table f1' -c 'drop table f2'; do bin/ysqlsh -c 'create temp table tmp (i int)' -c "select yb_get_range_split_clause('tmp'::regclass)"; done The error is #18192
but the tserver logs
Of note is that adding a sleep avoids the issue: while bin/ysqlsh -c 'create table r (i int primary key)' -c 'create table f1 (i int references r)' -c 'create table f2 (i int references r)' -c 'drop table r cascade' -c 'drop table f1' -c 'drop table f2'; do bin/ysqlsh -c 'create temp table tmp (i int)' -c "select yb_get_range_split_clause('tmp'::regclass)"; sleep 2; done |
Summary: Setting GUC yb_debug_report_error_stacktrace is intended to print stack traces on error messages in YSQL to aid in debugging. The implementation has several issues, the main ones being - In an ereport with errmsg and errdetail and errhint, the same stack trace is printed three times. - Some cases are missed, such as errmsg_plural. The old implementation puts logic to add the stack trace inside the EVALUATE_MESSAGE and YB_EVALUATE_STATUS_DATA_MESSAGE (now renamed to YB_EVALUATE_MESSAGE_FROM_STATUS) macros. YB_EVALUATE_STATUS_DATA_MESSAGE did not have the first issue because it took an argument dump_stacks to control whether to avoid adding stack trace. EVALUATE_MESSAGE, however, did not have such logic, and it is called from errmsg, errdetail, and errhint, to name a few. To avoid modifying the PG-owned macro EVALUATE_MESSAGE to take a parameter dump_stacks (because this will be a pain when merging with upstream PG), add the stack trace separately using the existing "appendval" functionality. This is done through new function yb_additional_errmsg. Call this function after EVALUATE_MESSAGE, EVALUATE_MESSAGE_PLURAL, and YB_EVALUATE_MESSAGE_FROM_STATUS invocations that modify edata->message so that detail and hint are clearly unaffected. In doing so, cover a new case errmsg_plural that wasn't previously covered. Secondly, restructure YB_EVALUATE_STATUS_DATA_MESSAGE to be more aligned with PG-owned EVALUATE_MESSAGE: - Rename to YB_EVALUATE_MESSAGE_FROM_STATUS ("_DATA" is not necessary). - Move memory context switching outside the macro. - Implicitly assume fmt variable is in frame. - Take domain, targetfield, appendval, translateit params. - pstrdup buf.data to potentially save memory (see EVALUATE_MESSAGE comment). Do similar alignment for the functions related to this macro: - Rename yb_errmsg_from_status_data to yb_errmsg_from_status (this should align with errmsg). - Rename yb_detail_from_status_data to yb_errdetail_from_status (this should align with errdetail). - Align spacing of comment after return. Also: - Rename yb_message_from_status_data to yb_format_and_append. - Fix typo in a YB comment. Jira: DB-9991 Test Plan: Manual tests that don't need code change: #!/usr/bin/env bash set -eux ./yb_build.sh fastdebug --gcc11 bin/yb-ctl destroy bin/yb-ctl create --ysql_pg_conf_csv yb_debug_report_error_stacktrace=true sleep 5 # errmsg, errdetail, errhint count=$(bin/ysqlsh -c 'create table t (i int)' -c 'alter table t add primary key (i)' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # yb_errmsg_from_status count=$(bin/ysqlsh -c 'insert into t values (1), (1)' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # errmsg_internal count=$(bin/ysqlsh -c 'begin' -c 'set transaction_deferrable = false' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # elog_finish count=$(bin/ysqlsh -c 'create temp table tmp (i int)' -c "select yb_get_range_split_clause('tmp'::regclass)" |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # Workaround for #20820 (comment) sleep 2 # errmsg_plural count=$(bin/ysqlsh -c 'create table r (i int primary key)' -c 'create table f1 (i int references r)' -c 'create table f2 (i int references r)' -c 'drop table r cascade' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] Test yb_errdetail_from_status doesn't give additional stack trace by applying this patch diff --git i/src/yb/yql/pgwrapper/pg_wait_on_conflict-test.cc w/src/yb/yql/pgwrapper/pg_wait_on_conflict-test.cc index 620fdea3632..000b7cb9d70 100644 --- i/src/yb/yql/pgwrapper/pg_wait_on_conflict-test.cc +++ w/src/yb/yql/pgwrapper/pg_wait_on_conflict-test.cc @@ -80,7 +80,8 @@ class PgWaitQueuesTest : public PgMiniTestBase { ANNOTATE_UNPROTECTED_WRITE(FLAGS_enable_wait_queues) = true; ANNOTATE_UNPROTECTED_WRITE(FLAGS_TEST_select_all_status_tablets) = true; ANNOTATE_UNPROTECTED_WRITE(FLAGS_force_single_shard_waiter_retry_ms) = 10000; - ANNOTATE_UNPROTECTED_WRITE(FLAGS_ysql_pg_conf_csv) = MaxQueryLayerRetriesConf(0); + ANNOTATE_UNPROTECTED_WRITE(FLAGS_ysql_pg_conf_csv) = Format("$0,$1", + MaxQueryLayerRetriesConf(0), "yb_debug_report_error_stacktrace=true"); ANNOTATE_UNPROTECTED_WRITE(FLAGS_yb_enable_read_committed_isolation) = true; PgMiniTestBase::SetUp(); } then running ./yb_build.sh fastdebug --gcc11 \ --cxx-test pgwrapper_pg_wait_on_conflict-test \ --gtest_filter PgWaitQueuesTest.MultiTabletFairness and examining logs to see stack trace shows up only for the message, not detail. Note the logs are a bit hard to read because the Status to string seems to print the error twice, so it could appear as if the stack is being duplicated when actually, from the postgres logs, it is not being duplicated. Close: #21017 Reviewers: amartsinchyk, tvesely Reviewed By: tvesely Subscribers: amartsinchyk, yql Differential Revision: https://phorge.dev.yugabyte.com/D32270
Summary: This bug can be reproduced by enabling a specific case in a unit test `VerifyWaitStateEntered/kBackfillIndex_WaitForAFreeSlot`. The relevant code is ``` RSTATUS_DCHECK(!options.ddl_mode(), NotSupported, "Restarting a DDL transaction not supported"); ``` This happens when the PG backend is trying to restart a DDL statement and at tserver side this isn't expected. In debug build it results in a CHECK failure. In this test, PG tries to restart a `CREATE INDEX idx_0 on t (value)` statement. Due to a concurrent write statement and some additional wait-state related logic in this unit test, the CREATE INDEX hits a `Restart read required` error. Then due to the following code, the DDL state is reset: ``` PG_CATCH(); { if (is_ddl) { /* * It is possible that nesting_level has wrong value due to error. * Ddl transaction state should be reset. */ YBResetDdlState(); } PG_RE_THROW(); } PG_END_TRY(); ``` In particular, `YBResetDdlState()` invokes ``` HandleYBStatus(YBCPgClearSeparateDdlTxnMode()); ``` As a result, PG no longer knows that the current statement is a DDL statement, and in function `yb_is_restart_possible` ``` if (IsYBReadCommitted()) { if (YBGetDdlNestingLevel() != 0) { elog(LOG, "READ COMMITTED retry semantics don't support DDLs"); *rc_ignoring_ddl_statement = true; return false; } } ``` Now that `YBGetDdlNestingLevel() == 0` after `YBResetDdlState()`, we fail to hit `READ COMMITTED retry semantics don't support DDLs`, hence there will be the retry of the DDL statement via ``` if (YBCIsRestartReadError(edata->yb_txn_errcode)) { YBCRestartTransaction(); } ``` Then when we retry, we re-detect that `CREATE INDEX` is a DDL statement. However at this time `YBCRestartTransaction()` is already invoked, which has set `need_restart_` to true. I changed RSTATUS_DCHECK to return STATUS when `options.ddl_mode()` is true. In this way we avoid tserver CHECK failure in debug build. Jira: DB-9809 Test Plan: Make the following test change: ``` diff --git a/src/yb/integration-tests/wait_states-itest.cc b/src/yb/integration-tests/wait_states-itest.cc index 08a4f7d6ea..ada826780e 100644 --- a/src/yb/integration-tests/wait_states-itest.cc +++ b/src/yb/integration-tests/wait_states-itest.cc @@ -670,8 +670,7 @@ INSTANTIATE_TEST_SUITE_P( ash::WaitStateCode::kRetryableRequests_SaveToDisk, ash::WaitStateCode::kMVCC_WaitForSafeTime, ash::WaitStateCode::kLockedBatchEntry_Lock, - // TODO(#20820) : Enable once #20820 is fixed - // ash::WaitStateCode::kBackfillIndex_WaitForAFreeSlot, + ash::WaitStateCode::kBackfillIndex_WaitForAFreeSlot, ash::WaitStateCode::kCreatingNewTablet, ash::WaitStateCode::kSaveRaftGroupMetadataToDisk, ash::WaitStateCode::kDumpRunningRpc_WaitOnReactor, ``` then run the test ./yb_build.sh debug --cxx-test integration-tests_wait_states-itest --gtest_filter WaitStateITest/AshTestVerifyOccurrence.VerifyWaitStateEntered/kBackfillIndex_WaitForAFreeSlot Note that the test still fails but the `!options.ddl_mode()` Check failure no longer appears. Reviewers: pjain, amitanand Reviewed By: pjain Subscribers: bogdan, ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D32444
…bbc01c4' into pg15 Summary: Merge YB master commit 9e07d79 titled [#21058] DocDB: Revert "[#14165] DocDB: Rollback only the newest transaction in a deadlock" and committed 2024-02-15T17:28:56-06:00 into YB pg15. YB pg15 initial merge is 55782d5. - jenkins_jobs.yml: YB master 9f91eeb changes clang16 to clang17. pg15 branch has multiple modifications to this file. Apply both. - yb_uniqkeys.c: yb_is_const_clause_for_distinct_pushdown: YB master 9161aec and YB pg15 40e68e2 touch the same code. Take pg15's version, as suggested by Patnaik. - slot.c: ReplicationSlotCreate: YB master aa3528e adds extra parameter CRSSnapshotAction yb_snapshot_action; upstream PG 19890a064ebf53dedcefed0d8339ed3d449b06e6 adds extra parameter two_phase. Apply both. Maybe two_phase should be passed down to YBCCreateReplicationSlot. - slotfuncs.c: upstream PG 9f06d79ef831ffa333f908f6d3debdb654292414 moves two ReplicationSlotCreate calls into create_physical_replication_slot and create_logical_replication_slot. YB master aa3528e passes CRS_NOEXPORT_SNAPSHOT to ReplicationSlotCreate. Move that and any comments. Note that create_logical_replication_slot and create_physical_replication_slot are also additionally called by copy_replication_slot, and hopefully CRS_NOEXPORT_SNAPSHOT makes sense for that case. - walsender.c: CreateReplicationSlot: (same as slot.c) - elog.c: - function declarations: YB pg15 85f8a82 moves declarations up, so apply the changes of YB master 2d0bd35 there. - yb_write_status_to_server_log: YB pg15 85f8a82 adds call of yb_message_from_status_data that gets renamed to yb_format_and_append by YB master 2d0bd35. - elog_start, elog_finish: upstream PG 17a28b03645e27d73bf69a95d7569b61e58f06eb deletes these functions, so drop changes to elog_finish from YB master 2d0bd35. - EVALUATE_MESSAGE: YB master 2d0bd35 deletes yb_debug_report_error_stacktrace if condition whereas upstream PG d6c55de1f99a9028540516316b95321a7b12a540 removes pfree(fmtbuf). Do both deletions. - Check that YB_EVALUATE_MESSAGE_FROM_STATUS is up-to-date with EVALUATE_MESSAGE: it is now closer due to PG d6c55de1f99a9028540516316b95321a7b12a540 removing formatting code that was never present in YB_EVALUATE_MESSAGE_FROM_STATUS. Update the comment for YB_EVALUATE_MESSAGE_FROM_STATUS accordingly. - Check that yb_additional_errmsg is called wherever needed as explained in the comment. The only new candidate function between PG 11.2 and PG 15.2 is errhint_plural, and that should not get yb_additional_errmsg since it's a hint. - pgbench.c: - includes: upstream PG dddf4cdc3300073ec04b2c3e482a4c1fa4b8191b moves pgbench.h include higher. Adjacent conflict with YB master 6a009b1 adding ysql_bench_metrics_handler.h include. - variable declarations: upstream PG 0e39a608ed5545cc6b9d538ac937c3c1ee8cdc36 adds pg_time_usec_t epoch_shift in the same location YB master 6a009b1 adds YsqlBenchMetricEntry *ysql_bench_metric_entry = NULL. Take both. - usage: YB pg15 initial merge moves --batch-size, introduced by YB master 62fd877, down into "Common options". This is an adjacent conflict with YB master 6a009b1 adding "Prometheus metrics options". - main: - old YB master 62fd877 adds batch-size, old YB master af25ba5 adds max-tries, incoming YB master 6a009b1 adds yb-metrics-bind-address and yb-metrics-bind-port. YB pg15 initial merge preserves max-tries but deletes batch-size (this might be because max-tries was an import that originated from upstream PG while batch-size was not). - It also handles the max-tries case properly in getopt_long while loop but omits the batch-size case. Ignore batch-size for now and try to faithfully preserve the yb-metrics-* options. This requires renumbering the cases. - Upstream PG 547f04e7348b6ed992bd4a197d39661fe7c25097 changes a line to "conn_total_duration = 0" where YB master 6a009b1 adjacently adds "Start metrics webserver". Apply both. - threadRun: YB master 6a009b1 adds "if (ysql_bench_metric_entry)", but upstream PG 9f75e3772350fb66f20a3d7f33bc94f30300d7eb moves all that code into a new function printProgressReport. Move the changes there. It is nontrivial to find where it belongs: after "fprintf(stderr," because that is the original line that was changed by old YB master af25ba5. It appears the changes of that old YB master af25ba5 have not yet been translated to the new location printProgressReport, but ignore this issue for now. - libpq-be.h: YB master 1b784fe adds yb_is_ssl_enabled_in_logical_conn while upstream PG adds authn_id in the same place. Apply both. - slot.h: (same as slot.c) - pg15_tests/test_D31368.sh, pg15_tests/passing_tests.tsv: thanks to YB master e62fdc8, TestPgRegressProc now passes, so move it to passing_tests.tsv. Test Plan: On Almalinux 8: #!/usr/bin/env bash set -eu ./yb_build.sh fastdebug --gcc11 for _ in {1..50}; do grep RegressProc pg15_tests/passing_tests.tsv; done | pg15_tests/run_tests.sh # org.yb.pgsql.TestAlterTableWithConcurrentTxn#testDmlTransactionAfterAlterOnCurrentResourceWithCachedMetadata # might be flaky? pg15_tests/run_tests.sh Tests from 1fc7b74: ./yb_build.sh fastdebug --gcc11 --enable_ysql_conn_mgr_test --java-test org.yb.pgsql.TestLDAPAuth --sj # TestYbRoleProfile without ysql_conn_mgr already fails. #./yb_build.sh fastdebug --gcc11 --enable_ysql_conn_mgr_test --java-test org.yb.pgsql.TestYbRoleProfile --sj Manual test from 2d0bd35: #!/usr/bin/env bash set -eux ./yb_build.sh fastdebug --gcc11 bin/yb-ctl destroy bin/yb-ctl create --ysql_pg_conf_csv yb_debug_report_error_stacktrace=true sleep 5 # errmsg, errdetail, errhint count=$(bin/ysqlsh -c 'create table t (i int)' -c 'alter table t add primary key (i)' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # yb_errmsg_from_status count=$(bin/ysqlsh -c 'insert into t values (1), (1)' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # errmsg_internal count=$(bin/ysqlsh -c 'begin' -c 'set transaction_deferrable = false' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # elog_finish count=$(bin/ysqlsh -c 'create temp table tmp (i int)' -c "select yb_get_range_split_clause('tmp'::regclass)" |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] # Workaround for #20820 (comment) sleep 2 # errmsg_plural count=$(bin/ysqlsh -c 'create table r (i int primary key)' -c 'create table f1 (i int references r)' -c 'create table f2 (i int references r)' -c 'drop table r cascade' |& grep -c YBCGetStackTrace) [ "$count" -eq 1 ] Running other test plan tests, got 1 2024-04-09T15:50:35-07:00 JAVA org.yb.pgsql.TestPgRegressDistinctPushdown 0 2024-04-09T15:50:57-07:00 JAVA org.yb.ysqlconnmgr.TestPreparedStatements#testPreparedQuery 0 2024-04-09T15:51:05-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.VerifySafeTimes 0 2024-04-09T15:51:12-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.VerifyBasicJsonParsing 0 2024-04-09T15:51:19-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.VerifyAlreadyProcessed 0 2024-04-09T15:51:27-07:00 integration-tests_xcluster_safe_time_service-test XClusterSafeTimeServiceTest.ComputeSafeTime 0 2024-04-09T15:51:35-07:00 integration-tests_xcluster_safe_time_service-test XClusterSafeTimeServiceTest.ComputeSafeTimeWithFilters 0 2024-04-09T15:51:43-07:00 integration-tests_xcluster_safe_time_service-test XClusterSafeTimeServiceTest.ComputeSafeTimeWithFiltersSingleTablet 0 2024-04-09T15:52:14-07:00 pgwrapper_pg_get_lock_status-test PgGetLockStatusTest.TestWaiterLockContainingColumnId 0 2024-04-09T15:52:27-07:00 tserver_clone_operation-test CloneOperationTest.Hardlink 8 2024-04-09T15:52:40-07:00 tserver_clone_operation-test CloneOperationTest.CloneAndRestore 8 2024-04-09T15:52:52-07:00 tserver_clone_operation-test CloneOperationTest.CloneOnSameDrive 0 2024-04-09T15:53:16-07:00 integration-tests_clone-tablet-itest CloneTabletExternalItest/CloneTabletExternalCrashItest.CrashDuringApply/0 0 2024-04-09T15:54:29-07:00 integration-tests_xcluster_ysql_index-test XClusterDbScopedYsqlIndexTest.CreateIndex 0 2024-04-09T15:55:46-07:00 integration-tests_xcluster_ysql_index-test XClusterDbScopedYsqlIndexTest.CreateIndexWithWorkload 0 2024-04-09T15:57:55-07:00 integration-tests_xcluster_db_scoped-test XClusterDBScopedTest.CreateTable 0 2024-04-09T15:58:04-07:00 integration-tests_xcluster_ddl_queue_handler-test XClusterDDLQueueHandlerMockedTest.SkipScanWhenNoNewRecords 0 2024-04-09T15:59:51-07:00 JAVA org.yb.pgsql.TestPgWriteRestart 1 2024-04-09T16:31:44-07:00 JAVA org.yb.pgsql.TestPgEncryption 0 2024-04-09T16:32:52-07:00 JAVA org.yb.pgsql.TestPgDdlConcurrency#testModifiedTableWrite 0 2024-04-09T16:38:02-07:00 JAVA org.yb.pgsql.ConcurrentTablespaceTest - TestPgRegressDistinctPushdown fails because EXPLAIN differences - The two Clone tests fail with "Bad status: Not found (yb/tserver/ts_tablet_manager.cc:2256): Tablet target_tablet_id not found" - TestPgEncryption fails with "Connection to 127.0.0.182:27726 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections." Jenkins: rebase: pg15 Reviewers: aagrawal, tfoucher, xCluster, hsunder Reviewed By: tfoucher Subscribers: patnaik.balivada, ybase, ycdcxcluster, yql Differential Revision: https://phorge.dev.yugabyte.com/D33933
Jira Link: DB-9809
Description
RSTATUS_DCHECK(!options.ddl_mode(), NotSupported, "Restarting a DDL transaction not supported");
[ts-1] F0128 16:15:26.960765 44908 pg_client_session.cc:1177] Check failed: !options.ddl_mode() Restarting a DDL transaction not supported
Fatal failure details written to ${BUILD_ROOT}/yb-test-logs/tests-integration-tests__wait_states-itest/
I've added a test that essentially creates index on a pg table in a loop, while there is 1 thread each doing reads/writes.
In debug builds this seems to cause a check-failure
RSTATUS_DCHECK(!options.ddl_mode(), NotSupported, "Restarting a DDL transaction not supported");
on jenkins:
/var/lib/jenkins/jobs/yugabyte-db-phabricator/builds/235434/archive/build/asan-clang16-dynamic-ninja/yb-test-logs/tests-integration-tests__wait_states-itest/WaitStateITest__AshTestVerifyOccurrence_VerifyWaitStateEntered__kBackfillIndex_WaitForAFreeSlot.log
/var/lib/jenkins/jobs/yugabyte-db-phabricator/builds/235432/archive/build/debug-gcc11-dynamic-ninja/yb-test-logs/tests-integration-tests__wait_states-itest/WaitStateITest__AshTestVerifyOccurrence_VerifyWaitStateEntered__kBackfillIndex_WaitForAFreeSlot.log
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: