yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.9k stars 1.06k forks source link

[YSQL] A DDL that fails due to a kConflict error appears to restart with read committed semantics #18761

Open tvesely opened 1 year ago

tvesely commented 1 year ago

Jira Link: DB-7648

Description

After a kConflict error the DDL statement is retried, and as a result the Postgres backend crashes with the following:

[ts-1] Fatal failure details written to ${BUILD_ROOT}/yb-test-logs/tests-pgwrapper__pg_ddl_concurrency-test/PgDDLConcurrencyTest_IndexCreation.fatal_failure_details.ts-1.2023-08-17T20_24_04.pid72149.txt
[ts-1] F20230817 20:24:04 ../../src/yb/yql/pggate/pg_txn_manager.cc:343] Check failed: !IsDdlMode() READ COMMITTED semantics don't apply to DDL transactions
[ts-1] ${BUILD_ROOT}/../../src/yb/util/logging.cc:465:     @     0x7f3c1d8e801d  yb::LogFatalHandlerSink::send(int, char const*, char const*, int, tm const*, char const*, unsigned long)
[ts-1]     @     0x7f3c1cac6af3 
[ts-1]     @     0x7f3c1caaf150 
[ts-1]     @     0x7f3c1cab2415 
[ts-1]     @     0x7f3c1cabb28a 
[ts-1] ${BUILD_ROOT}/../../src/yb/yql/pggate/pg_txn_manager.cc:342:     @     0x7f3c277ac4fe  yb::pggate::PgTxnManager::ResetTransactionReadPoint()
[ts-1] ${BUILD_ROOT}/../../src/yb/yql/pggate/pggate.cc:1870:     @     0x7f3c2761e460  yb::pggate::PgApiImpl::ResetTransactionReadPoint()
[ts-1] ${BUILD_ROOT}/../../src/yb/yql/pggate/ybc_pggate.cc:1331:     @     0x7f3c275eab11  YBCPgResetTransactionReadPoint
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/utils/time/../../../../../../../src/postgres/src/backend/utils/time/snapmgr.c:384:     @     0x561a2b3bb10f  GetTransactionSnapshot
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/pquery.c:1192:     @     0x561a2ae909ee  PortalRunUtility
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/pquery.c:0:     @     0x561a2ae8f0f3  PortalRunMulti
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/pquery.c:815:     @     0x561a2ae8d76a  PortalRun
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:1208:     @     0x561a2ae845e4  exec_simple_query
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4756:     @     0x561a2ae845e4  yb_exec_simple_query_impl
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4724:     @     0x561a2ae85321  yb_exec_query_wrapper_one_attempt
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4748:     @     0x561a2ae80744  yb_exec_query_wrapper
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4771:     @     0x561a2ae80744  yb_exec_simple_query
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:5396:     @     0x561a2ae7abd7  PostgresMain
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4658:     @     0x561a2acaa792  BackendRun
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4296:     @     0x561a2aca85a5  BackendStartup
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775:     @     0x561a2aca85a5  ServerLoop
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431:     @     0x561a2aca09a2  PostmasterMain
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:     @     0x561a2aa522b5  PostgresServerProcessMain
[ts-1]     @     0x561a2aa52a01 
[ts-1]     @     0x7f3c2689cd84 
[ts-1]     @     0x561a2a11081d 
[ts-1] 
[ts-1] *** Check 

Warning: Please confirm that this issue does not contain any sensitive information

tvesely commented 1 year ago

This can be reproduced by enabling read committed and running the following test:

./yb_build.sh --cxx-test pgwrapper_pg_ddl_concurrency-test --gtest_filter PgDDLConcurrencyTest.IndexCreation
tvesely commented 1 year ago

Another similar failure with a slightly different root cause:

[ts-1] F0829 20:14:57.107990 151433 pg_txn_manager.cc:345] Check failed: !IsDdlMode() READ COMMITTED semantics don't apply to DDL transactions
[ts-1] Fatal failure details written to ${BUILD_ROOT}/yb-test-logs/tests-pgwrapper__pg_catalog_version-test/PgCatalogVersionTest_FixCatalogVersionTable.fatal_failure_details.ts-1.2023-08-29T20_14_57.pid151433.txt
[ts-1] F20230829 20:14:57 ../../src/yb/yql/pggate/pg_txn_manager.cc:345] Check failed: !IsDdlMode() READ COMMITTED semantics don't apply to DDL transactions
[ts-1] /net/ip-10-9-7-61.us-west-2.compute.internal/share/jenkins/jenkins-yugabyte-db-phabricator-186855/yugabyte-db/build/debug-clang16-dynamic-ninja/../../src/yb/util/logging.cc:465:     @     0xffff7dd7182b  yb::LogFatalHandlerSink::send(int, char const*, char const*, int, tm const*, char const*, unsigned long)
[ts-1]     @     0xffff7d42327b 
[ts-1]     @     0xffff7d423a23 
[ts-1]     @     0xffff7d4262db 
[ts-1] ${BUILD_ROOT}/../../src/yb/yql/pggate/pg_txn_manager.cc:344:     @     0xffff858f58cf  yb::pggate::PgTxnManager::ResetTransactionReadPoint()
[ts-1] ${BUILD_ROOT}/../../src/yb/yql/pggate/pggate.cc:1879:     @     0xffff857c15db  yb::pggate::PgApiImpl::ResetTransactionReadPoint()
[ts-1] ${BUILD_ROOT}/../../src/yb/yql/pggate/ybc_pggate.cc:1335:     @     0xffff857a1d8f  YBCPgResetTransactionReadPoint
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/utils/time/../../../../../../../src/postgres/src/backend/utils/time/snapmgr.c:387:     @     0xaaaade4dd663  GetTransactionSnapshot
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/executor/../../../../../../src/postgres/src/backend/executor/functions.c:1175:     @     0xaaaaddf7f8c7  fmgr_sql
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/catalog/yb_catalog/../../../../../../../src/postgres/src/backend/catalog/yb_catalog/yb_catalog_version.c:117:     @     0xaaaaddd54307  YbCallSQLIncrementCatalogVersions
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/catalog/yb_catalog/../../../../../../../src/postgres/src/backend/catalog/yb_catalog/yb_catalog_version.c:142:     @     0xaaaaddd515ff  YbIncrementMasterDBCatalogVersionTableEntryImpl
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/catalog/yb_catalog/../../../../../../../src/postgres/src/backend/catalog/yb_catalog/yb_catalog_version.c:249:     @     0xaaaaddd51523  YbIncrementMasterCatalogVersionTableEntry
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/utils/misc/../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:1436:     @     0xaaaade49b55f  YBDecrementDdlNestingLevel
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/utils/misc/../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:1944:     @     0xaaaade4a3447  YBTxnDdlProcessUtility
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/utility.c:377:     @     0xaaaade2305ab  ProcessUtility
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/pquery.c:1206:     @     0xaaaade22fb1f  PortalRunUtility
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/pquery.c:1366:     @     0xaaaade22eb2b  PortalRunMulti
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/pquery.c:815:     @     0xaaaade22df43  PortalRun
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:1217:     @     0xaaaade22b0b7  exec_simple_query
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4753:     @     0xaaaade228b17  yb_exec_simple_query_impl
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4721:     @     0xaaaade228c57  yb_exec_query_wrapper_one_attempt
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4745:     @     0xaaaade228ad3  yb_exec_query_wrapper
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:4768:     @     0xaaaade223e4b  yb_exec_simple_query
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/tcop/../../../../../../src/postgres/src/backend/tcop/postgres.c:5393:     @     0xaaaade222c3f  PostgresMain
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4658:     @     0xaaaade135d83  BackendRun
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4296:     @     0xaaaade134c23  BackendStartup
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1775:     @     0xaaaade13345f  ServerLoop
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/postmaster/../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1431:     @     0xaaaade12fd3f  PostmasterMain
[ts-1] ${YB_SRC_ROOT}/src/postgres/src/backend/main/../../../../../../src/postgres/src/backend/main/main.c:234:     @     0xaaaade012597  PostgresServerProcessMain
[ts-1]     @     0xaaaade012a5f 
[ts-1]     @     0xffff850a4383 
[ts-1] 

This one is reproduced by running the following test with READ COMMITTED enabled:

./yb_build.sh --cxx-test pgwrapper_pg_catalog_version-test --gtest_filter PgCatalogVersionTest.FixCatalogVersionTable 

In YSQL, READ COMMITTED semantics only apply to non-ddl statements. However, in this test we end up attempting to update pg_yb_catalog_version when performing a ALTER ROLE yugabyte SUPERUSER statement. YbIncrementMasterCatalogVersionTableEntry ends up calling GetTransactionSnapshot and subsequently YBCPgResetTransactionReadPoint because the ddl nesting level is 0. This leads to the crash with the message: "Check failed: !IsDdlMode() READ COMMITTED semantics don't apply to DDL transactions".

In both scenarios we are resetting a transaction read point during a DDL statement. Both cases are using the assumption that we can rely on YBGetDdlNestingLevel() == 0 to mean that the statement being executed is not a DDL. We need to understand why the nesting level is 0 here.

robertsami commented 11 months ago

While we were blocking retries for DDLs, we may solve this issue by allowing retries for DDLs in read committed. Needs more investigation to ensure we can do this safely

Would need to at least remove this dcheck from the update above:

[ts-1] F0829 20:14:57.107990 151433 pg_txn_manager.cc:345] Check failed: !IsDdlMode() READ COMMITTED semantics don't apply to DDL transactions
robertsami commented 11 months ago

5/7 of the concerning tests look ok to have retries enabled. need to confirm whether truncate is safe to handle this way as well

robertsami commented 10 months ago

in the short term, we should make a change to avoid crashing in such a case

long term, we are waiting for changes to improve ddl atomicity before fixing this entirely

cc @pkj415 to link details of the dependent github issue for longer term change