yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.95k stars 1.07k forks source link

[DocDB] Core dumps in CatalogManager. while executing PITR with cross DB DDLs #21139

Closed shishir2001-yb closed 4 months ago

shishir2001-yb commented 8 months ago

Jira Link: DB-10079

Description

Tried on version 2.21.1.0-b124 Logs: https://drive.google.com/file/d/1KA7F5BpywI8qT0IhH9vTQ4zUIZz2DliT/view?usp=sharing

Encountered the following Core dump while running a new Cross DB DDLs test. Find the whole backtrace in the logs

thread #1, name = 'yb-master', stop reason = signal SIGABRT
  * frame #0: 0x00007fe595888acf libc.so.6`raise + 271
    frame #1: 0x00007fe59585bea5 libc.so.6`abort + 295
    frame #2: 0x00005653e7069013 yb-master`abort_message + 195
    frame #3: 0x00005653e7068cb2 yb-master`__cxa_pure_virtual + 18
    frame #4: 0x00005653e58b7809 yb-master`yb::master::CatalogManager::AddTableAndTabletEntriesToPB(tables=<unavailable>, out=0x00007fe55daa47f8, tablet_snapshot_info=0x0000000000000000, all_tablets=0x0000000000000000) at catalog_manager_ext.cc:409:40
    frame #5: 0x00005653e58b9005 yb-master`yb::master::CatalogManager::CollectEntries(this=0x000027eabfc12000, table_identifiers=<unavailable>, flags=<unavailable>) at catalog_manager_ext.cc:443:3
    frame #6: 0x00005653e58b9a0e yb-master`yb::master::CatalogManager::CollectEntriesForSnapshot(this=<unavailable>, tables=<unavailable>) at catalog_manager_ext.cc:462:27
    frame #7: 0x00005653e58ba5d2 yb-master`non-virtual thunk to yb::master::CatalogManager::CollectEntriesForSnapshot(google::protobuf::RepeatedPtrField<yb::master::TableIdentifierPB> const&) at catalog_manager_ext.cc:0
    frame #8: 0x00005653e58f4352 yb-master`yb::master::CatalogManager::MakeSnapshotSchedulesToObjectIdsMap(yb::master::SysRowEntryType) [inlined] yb::master::MasterSnapshotCoordinator::Impl::CollectEntries(this=0x000027eabfd95b80, filter=0x000027eaa6170e48) at master_snapshot_coordinator.cc:1932:21
    frame #9: 0x00005653e58f431f yb-master`yb::master::CatalogManager::MakeSnapshotSchedulesToObjectIdsMap(yb::master::SysRowEntryType) at master_snapshot_coordinator.cc:1003:22
    frame #10: 0x00005653e58f41e1 yb-master`yb::master::CatalogManager::MakeSnapshotSchedulesToObjectIdsMap(yb::master::SysRowEntryType) [inlined] yb::master::MasterSnapshotCoordinator::MakeSnapshotSchedulesToObjectIdsMap(this=<unavailable>, type=<unavailable>) at master_snapshot_coordinator.cc:2265:17
    frame #11: 0x00005653e58f41e1 yb-master`yb::master::CatalogManager::MakeSnapshotSchedulesToObjectIdsMap(this=<unavailable>, type=TABLE) at catalog_manager_ext.cc:3451:32
    frame #12: 0x00005653e5829460 yb-master`yb::master::CatalogManager::DeleteTableInternal(this=0x000027eabfc12000, req=0x000027eaba974fe8, resp=0x00007fe55daa5010, rpc=0x0000000000000000, epoch=0x000027eaba975018) at catalog_manager.cc:6369:34
    frame #13: 0x00005653e5827149 yb-master`yb::master::CatalogManager::DeleteTable(this=0x000027eabfc12000, req=0x000027eaba974fe8, resp=0x00007fe55daa5010, rpc=0x0000000000000000, epoch=0x000027eaba975018) at catalog_manager.cc:6344:10
    frame #14: 0x00005653e58265b0 yb-master`yb::RunnableImpl<yb::master::CatalogManager::VerifyTablePgLayer(scoped_refptr<yb::master::TableInfo>, bool, yb::master::LeaderEpoch const&)::$_0>::Run() at catalog_manager.cc:4470:5

Test details

1. Run a workload that changes databases for 20-30 minutes.
2. Schedule point-in-time recovery (PITR) for 10 random databases.
3. Create a backup for one random database.
4. Start and stop the workload after 10 minutes.
5. Note the time for the first PITR.
6. Keep the workload running.
7. Perform another PITR at a random time while the workload continues.
tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode"
            },
master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "tablet_split_high_phase_shard_count_per_node": 20000,
                "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2MB
                # low_phase_size 100KB
                "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
                "tablet_split_low_phase_shard_count_per_node": 10000,
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode"
            }

List of DDLs executed in sample app

private static List<List<String>> ddlList = List.of(
            List.of("CREATE INDEX idx1 ON ? (k)", "DROP INDEX idx1"),
            List.of("CREATE TABLE tempTable1 AS SELECT * FROM ? limit 1000000", "ALTER TABLE tempTable1 RENAME TO tempTable1_new", "DROP TABLE tempTable1_new"),
            List.of("CREATE MATERIALIZED VIEW mv1 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv1", "DROP MATERIALIZED VIEW mv1"),
            List.of("ALTER TABLE ? ADD newColumn1 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? DROP newColumn1"),
            List.of("ALTER TABLE ? ADD newColumn2 TEXT NULL", "ALTER TABLE ? DROP newColumn2"),
            List.of("CREATE VIEW view1_? AS SELECT k from ?", "DROP VIEW view1_?"),
            List.of("ALTER TABLE ? ADD newColumn3 TEXT DEFAULT 'dummyString'", "ALTER TABLE ? ALTER newColumn3 TYPE VARCHAR(1000)", "ALTER TABLE ? DROP newColumn3"),
            List.of("CREATE TABLE tempTable2 AS SELECT * FROM ? limit 1000000", "CREATE INDEX idx2 ON tempTable2(k)", "ALTER TABLE ? ADD newColumn4 TEXT DEFAULT 'dummyString'", "ALTER TABLE tempTable2 ADD newColumn2 TEXT DEFAULT 'dummyString'", "TRUNCATE table ? cascade", "ALTER TABLE ? DROP newColumn4", "ALTER TABLE tempTable2 DROP newColumn2", "DROP INDEX idx2", "DROP TABLE tempTable2"),
            List.of("CREATE VIEW view2_? AS SELECT k from ?", "CREATE MATERIALIZED VIEW mv2 as SELECT k from ? limit 10000", "REFRESH MATERIALIZED VIEW mv2", "DROP MATERIALIZED VIEW mv2", "DROP VIEW view2_?")
 );

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

druzac commented 5 months ago

Hard to pin down the stack trace but this looks like another race between a DDL and PITR. In this case a table creation was aborted by the async transaction verification check.

# tb_0 is created.
I0222 07:57:13.108202 97082 ybccmds.c:593] Creating Table postgres_6.public.tb_0 with DocDB table name tb_0
I0222 07:57:13.760165 97082 ybccmds.c:593] Creating Table postgres_6.public.tb_0 with DocDB table name tb_0
I0222 07:57:14.489079 97082 ybccmds.c:593] Creating Table postgres_6.public.tb_0 with DocDB table name tb_0

I0222 07:57:14.491248 32967 catalog_manager.cc:3760] CreateTable from 172.151.28.132:39127:
  name: "tb_0"
  table_id: "00004006000030008000000000004301"
  namespace {
    id: "00004006000030008000000000000000"
    name: "postgres_6"
    database_type: YQL_DATABASE_PGSQL
  }
  ...

I0222 07:57:14.498389 32967 catalog_manager.cc:4336] Successfully created table tb_0 [id=00004006000030008000000000004301] in postgres_6 [id=00004006000030008000000000000000] per request from 172.151.28.132:39127

I0222 07:57:14.568974 97059 ybccmds.c:1045] Creating index postgres_1.public.idx1_tb_0
2024-02-22 07:57:15.065 UTC [97059] STATEMENT:  CREATE INDEX idx1_tb_0 ON tb_0 (k)

# restore begins
I0222 07:57:16.177105 31366 master_snapshot_coordinator.cc:1991] Creating a new restoration entry with id aa569a67-39c0-495e-85b2-8cfe4c326570
I0222 07:57:16.182569 31387 master_snapshot_coordinator.cc:573] Restore sys catalog from snapshot: { id: 640f7166-222c-4293-bc96-c1ed171c87e8 snapshot_hybrid_time: { physical: 1708588636106863 } schedule_id: 297f9344-b1ff-4589-b5d0-09e3016047a6 previous_snapshot_hybrid_time: { physical: 1708588284422289 } version: 3 initial_state: CREATING tablets: [{ id: b8dade46aaef4f1799dbf6d5fe608541 state: 2 last_error: OK running: 0 }] }, schedule: { id: 297f9344-b1ff-4589-b5d0-09e3016047a6 options: filter { tables { tables { table_name: "" namespace { id: "00004015000030008000000000000000" name: "postgres_21" database_type: YQL_DATABASE_PGSQL } } } } interval_sec: 12000 retention_duration_sec: 60000 } at { physical: 1708588627599610 }, op id: 1.177862

I0222 07:57:16.429977 31387 master_snapshot_coordinator.cc:1110] PITR: aa569a67-39c0-495e-85b2-8cfe4c326570, tablets to restore: [b8dade46aaef4f1799dbf6d5fe608541]

I0222 07:57:19.095114 97165 master_snapshot_coordinator.cc:955] PITR: Master metadata verified successfully for restoration aa569a67-39c0-495e-85b2-8cfe4c326570
I0222 07:57:19.095139 97165 master_snapshot_coordinator.cc:894] PITR: Issuing pending tserver RPCs for restoration aa569a67-39c0-495e-85b2-8cfe4c326570

# transaction verification fires and decides to delete the table.
I0222 07:57:18.806051 96629 catalog_manager.cc:4463] Table transaction failed, deleting: tb_0 [id=00004006000030008000000000004301]
I0222 07:57:18.806094 96729 catalog_manager.cc:6232] Servicing DeleteTable request from internal request: table { table_id: "00004006000030008000000000004301" table_name: "tb_0" } is_index_table: false

# sys catalog reload due to PITR begins. this invalidates in memory objects i.e. tables and can lead to segfaults as we don't have the greatest memory discipline.
I0222 07:57:18.808919 97165 catalog_manager.cc:1176] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: Loading table and tablet metadata into memory for term 1
I0222 07:57:18.808946 97165 catalog_manager.cc:1361] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: VisitSysCatalog: Wait on leader_lock_ for any existing operations to finish. Term: 1
I0222 07:57:18.808954 97165 catalog_manager.cc:1374] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: VisitSysCatalog: Acquire catalog manager lock_ before loading sys catalog.

I0222 07:57:18.910123 97165 catalog_loaders.cc:159] Enqueuing table for Transaction Verification: tb_0 [id=00004006000030008000000000004301]
I0222 07:57:18.910131 97165 catalog_loaders.cc:170] Loaded metadata for table tb_0 [id=00004006000030008000000000004301], state: RUNNING

# finished catalog load
I0222 07:57:19.092922 97165 catalog_manager.cc:1206] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: Completed load of sys catalog in term 1
druzac commented 5 months ago

I think this is another instance of #18257. The issue is the TableInfo class has raw pointers to its backing tablets, not owning pointers. TableInfo::GetTablets manufactures scoped_refptr<TabletInfo> objects out of TabletInfo* pointers.

druzac commented 4 months ago

Duplicate of #18257