Closed shishir2001-yb closed 4 months ago
Hard to pin down the stack trace but this looks like another race between a DDL and PITR. In this case a table creation was aborted by the async transaction verification check.
# tb_0 is created.
I0222 07:57:13.108202 97082 ybccmds.c:593] Creating Table postgres_6.public.tb_0 with DocDB table name tb_0
I0222 07:57:13.760165 97082 ybccmds.c:593] Creating Table postgres_6.public.tb_0 with DocDB table name tb_0
I0222 07:57:14.489079 97082 ybccmds.c:593] Creating Table postgres_6.public.tb_0 with DocDB table name tb_0
I0222 07:57:14.491248 32967 catalog_manager.cc:3760] CreateTable from 172.151.28.132:39127:
name: "tb_0"
table_id: "00004006000030008000000000004301"
namespace {
id: "00004006000030008000000000000000"
name: "postgres_6"
database_type: YQL_DATABASE_PGSQL
}
...
I0222 07:57:14.498389 32967 catalog_manager.cc:4336] Successfully created table tb_0 [id=00004006000030008000000000004301] in postgres_6 [id=00004006000030008000000000000000] per request from 172.151.28.132:39127
I0222 07:57:14.568974 97059 ybccmds.c:1045] Creating index postgres_1.public.idx1_tb_0
2024-02-22 07:57:15.065 UTC [97059] STATEMENT: CREATE INDEX idx1_tb_0 ON tb_0 (k)
# restore begins
I0222 07:57:16.177105 31366 master_snapshot_coordinator.cc:1991] Creating a new restoration entry with id aa569a67-39c0-495e-85b2-8cfe4c326570
I0222 07:57:16.182569 31387 master_snapshot_coordinator.cc:573] Restore sys catalog from snapshot: { id: 640f7166-222c-4293-bc96-c1ed171c87e8 snapshot_hybrid_time: { physical: 1708588636106863 } schedule_id: 297f9344-b1ff-4589-b5d0-09e3016047a6 previous_snapshot_hybrid_time: { physical: 1708588284422289 } version: 3 initial_state: CREATING tablets: [{ id: b8dade46aaef4f1799dbf6d5fe608541 state: 2 last_error: OK running: 0 }] }, schedule: { id: 297f9344-b1ff-4589-b5d0-09e3016047a6 options: filter { tables { tables { table_name: "" namespace { id: "00004015000030008000000000000000" name: "postgres_21" database_type: YQL_DATABASE_PGSQL } } } } interval_sec: 12000 retention_duration_sec: 60000 } at { physical: 1708588627599610 }, op id: 1.177862
I0222 07:57:16.429977 31387 master_snapshot_coordinator.cc:1110] PITR: aa569a67-39c0-495e-85b2-8cfe4c326570, tablets to restore: [b8dade46aaef4f1799dbf6d5fe608541]
I0222 07:57:19.095114 97165 master_snapshot_coordinator.cc:955] PITR: Master metadata verified successfully for restoration aa569a67-39c0-495e-85b2-8cfe4c326570
I0222 07:57:19.095139 97165 master_snapshot_coordinator.cc:894] PITR: Issuing pending tserver RPCs for restoration aa569a67-39c0-495e-85b2-8cfe4c326570
# transaction verification fires and decides to delete the table.
I0222 07:57:18.806051 96629 catalog_manager.cc:4463] Table transaction failed, deleting: tb_0 [id=00004006000030008000000000004301]
I0222 07:57:18.806094 96729 catalog_manager.cc:6232] Servicing DeleteTable request from internal request: table { table_id: "00004006000030008000000000004301" table_name: "tb_0" } is_index_table: false
# sys catalog reload due to PITR begins. this invalidates in memory objects i.e. tables and can lead to segfaults as we don't have the greatest memory discipline.
I0222 07:57:18.808919 97165 catalog_manager.cc:1176] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: Loading table and tablet metadata into memory for term 1
I0222 07:57:18.808946 97165 catalog_manager.cc:1361] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: VisitSysCatalog: Wait on leader_lock_ for any existing operations to finish. Term: 1
I0222 07:57:18.808954 97165 catalog_manager.cc:1374] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: VisitSysCatalog: Acquire catalog manager lock_ before loading sys catalog.
I0222 07:57:18.910123 97165 catalog_loaders.cc:159] Enqueuing table for Transaction Verification: tb_0 [id=00004006000030008000000000004301]
I0222 07:57:18.910131 97165 catalog_loaders.cc:170] Loaded metadata for table tb_0 [id=00004006000030008000000000004301], state: RUNNING
# finished catalog load
I0222 07:57:19.092922 97165 catalog_manager.cc:1206] T 00000000000000000000000000000000 P 83d683e185334f60a30980c425bf18d9: Completed load of sys catalog in term 1
I think this is another instance of #18257. The issue is the TableInfo
class has raw pointers to its backing tablets, not owning pointers. TableInfo::GetTablets
manufactures scoped_refptr<TabletInfo>
objects out of TabletInfo*
pointers.
Duplicate of #18257
Jira Link: DB-10079
Description
Tried on version 2.21.1.0-b124 Logs: https://drive.google.com/file/d/1KA7F5BpywI8qT0IhH9vTQ4zUIZz2DliT/view?usp=sharing
Encountered the following Core dump while running a new Cross DB DDLs test. Find the whole backtrace in the logs
Test details
List of DDLs executed in sample app
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information