yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[DocDB][PITR] Aborted RESTORE_ON_TABLET Task Results in Incomplete Sys Catalog Loading and Leader Role Instability #23356

Open shishir2001-yb opened 3 months ago

shishir2001-yb commented 3 months ago

Jira Link: DB-12281

Description

Version: 2.23.0.0-b361 Logs: Added in Jira

We have observed an issue where the system hangs during the abort process of a RESTORE_ON_TABLET task, leading to prolonged leader role changes and incomplete catalog loading. (Code link)

I0718 18:15:16.302239 643310 catalog_manager.cc:1336] T 00000000000000000000000000000000 P da456c642a764b3385ef40483fe59d1c: VisitSysCatalog: Acquire catalog manager lock_ before loading sys catalog.
...
I0718 18:15:16.305985 643310 catalog_entity_base.h:151] vlog1: 00004091000030008000000000004018.colocation.parent.tablename [id=00004091000030008000000000004018.colocation.parent.uuid]: Closing and aborting tasks
I0718 18:15:16.305991 643310 catalog_entity_base.cc:114] vlog1: AbortTasksAndCloseIfRequested: Close and abort task 0x36b8ba93a020 e13406d294fc4ddd986275bcf84ae29e (table 00004091000030008000000000004018.colocation.parent.tablename [id=00004091000030008000000000004018.colocation.parent.uuid]) Tablet Snapshot Operation RESTORE_ON_TABLET RPC b5b4b5ba-9e1c-4e74-9f4d-7043ea87ea75
I0718 18:15:16.306011 643310 catalog_entity_base.cc:77] vlog1: Removed task 0x36b8ba93a020 e13406d294fc4ddd986275bcf84ae29e (table 00004091000030008000000000004018.colocation.parent.tablename [id=00004091000030008000000000004018.colocation.parent.uuid]) Tablet Snapshot Operation RESTORE_ON_TABLET RPC b5b4b5ba-9e1c-4e74-9f4d-7043ea87ea75
E0718 18:15:16.306023 643310 state_with_tablets.cc:138] Restoration[b64bce02-bce8-4971-841c-e7490460d786/b5b4b5ba-9e1c-4e74-9f4d-7043ea87ea75/b6ee9a76-10b6-4833-bc08-db7669ae5c68]: Finished RESTORING at e13406d294fc4ddd986275bcf84ae29e that is not running and in state RESTORED: Operation expired (yb/master/catalog_entity_base.cc:107): Table closing
I0718 18:15:16.306048 643310 master_snapshot_coordinator.cc:1813] Setting restore complete time to { physical: 1721326516306048 }
Then following by a lot of “Leader not ready to serve requests”, which indicating sys catalog loading still hasn’t completed.
I0718 18:15:19.353892 165897 master_cluster_service.cc:241] Patching role from leader to follower because of: Leader not ready to serve requests (yb/master/scoped_leader_shared_lock.cc:134): Leader not yet ready to serve requests: leader_ready_term_ = 2; cstate.current_term = 4 [suppressed 118 similar messages]
I0719 20:27:14.371500 642473 master_cluster_service.cc:241] Patching role from leader to follower because of: Leader not ready to serve requests (yb/master/scoped_leader_shared_lock.cc:134): Leader not yet ready to serve requests: leader_ready_term_ = 2; cstate.current_term = 4 [suppressed 61 similar messages]

Considering "master_snapshot_coordinator.cc:1813] Setting restore complete time" is the last log message we saw about aborting restore, I’m guessing the thread hang somewhere before reaching here.

Test details:

 1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Create a PITR schedule on 20 random database
        4. Start a while loop and run it for 120 mins
          a. Note down time fr PITR(0) 
          b. Create a backup of 1 random database
          c. Start the cross DB DDL workload and stop it after 10 mins
          d. Note down the time for PITR(1)
          e. Start the cross DB DDL workload and run it for 10 mins
          f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago).
          g. Restore to PITR(1)
          h. Validate data
          i. Restore to PITR(0) with a probability of 0.6 and validate data
          j. Delete the PITR schedule for the backup db 
          k. Drop the database 
          l. Restore the backup
          m. Create the snapshot schedule for this new DB

G-flags:

             tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

druzac commented 3 months ago

There seem to be multiple issues here. One of them is the master leader deadlocks while reloading the sys catalog. The issue is a single thread deadlock, we hold exclusive access on CatalogManager::lock_ while aborting table tasks: https://github.com/yugabyte/yugabyte-db/blob/da9b281c2e8e4ee702dc7e5030947d31b92d21c1/src/yb/master/catalog_manager.cc#L1342-L1347

The stack from the deadlocked thread:

    @     0xffffadf54b3f  __GI___nanosleep
    @     0xaaaab39632a3  yb::master::CatalogManager::GetTabletInfos()
    @     0xaaaab3b4e113  yb::master::MasterSnapshotCoordinator::Impl::FinishRestoration()
    @     0xaaaab3b4ed1f  _ZZN2yb6master12_GLOBAL__N_116MakeDoneCallbackIN5boost11multi_index21multi_index_containerINSt3__110unique_ptrINS0_16RestorationStateENS6_14default_deleteIS8_EEEENS4_10indexed_byINS4_13hashed_uniqueINS4_13const_mem_funIS8_RKNS_17StronglyTypedUuidINS_28TxnSnapshotRestorationId_TagEEEXadL_ZNKS8_14restoration_idEvEEEEN4mpl_2naESM_SM_EENS4_17hashed_non_uniqueINS4_3tagINS0_25MasterSnapshotCoordinator4Impl11ScheduleTagESM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_EENSE_IS8_RKNSF_INS_22SnapshotScheduleId_TagEEEXadL_ZNKS8_11schedule_idEvEEEESM_SM_EESM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_EENS6_9allocatorISB_EEEENS6_6__bindIMSR_FvPS8_lEJPSR_RKNS6_12placeholders4__phILi1EEERlEEEEEDaPNS6_5mutexEPKT_RKNS1I_8key_typeERKNS6_12basic_stringIcNS6_11char_traitsIcEENS11_IcEEEERKT0_ENK11DoneFunctorclENS_6ResultIRKNS_7tserver26TabletSnapshotOpResponsePBEEE
    @     0xaaaab3b4f053  _ZNSt3__110__function6__funcIZN2yb6master12_GLOBAL__N_116MakeDoneCallbackIN5boost11multi_index21multi_index_containerINS_10unique_ptrINS3_16RestorationStateENS_14default_deleteISA_EEEENS7_10indexed_byINS7_13hashed_uniqueINS7_13const_mem_funISA_RKNS2_17StronglyTypedUuidINS2_28TxnSnapshotRestorationId_TagEEEXadL_ZNKSA_14restoration_idEvEEEEN4mpl_2naESO_SO_EENS7_17hashed_non_uniqueINS7_3tagINS3_25MasterSnapshotCoordinator4Impl11ScheduleTagESO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_EENSG_ISA_RKNSH_INS2_22SnapshotScheduleId_TagEEEXadL_ZNKSA_11schedule_idEvEEEESO_SO_EESO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_EENS_9allocatorISD_EEEENS_6__bindIMST_FvPSA_lEJPST_RKNS_12placeholders4__phILi1EEERlEEEEEDaPNS_5mutexEPKT_RKNS1K_8key_typeERKNS_12basic_stringIcNS_11char_traitsIcEENS13_IcEEEERKT0_E11DoneFunctorNS13_IS20_EEFvNS2_6ResultIRKNS2_7tserver26TabletSnapshotOpResponsePBEEEEEclEOS27_
    @     0xaaaab38d461b  yb::master::AsyncTabletSnapshotOp::Finished()
    @     0xaaaab38ad68f  yb::master::RetryingRpcTask::AbortAndReturnPrevState()
    @     0xaaaab38f3ee7  yb::master::CatalogEntityWithTasks::AbortTasksAndCloseIfRequested()
    @     0xaaaab3924337  yb::master::CatalogManager::VisitSysCatalog()
    @     0xaaaab39215eb  yb::master::CatalogManager::LoadSysCatalogDataTask()
    @     0xaaaab4a729f7  yb::ThreadPool::DispatchThread()
    @     0xaaaab4a6f177  yb::Thread::SuperviseThread()
    @     0xffffade778b7  start_thread
    @     0xffffaded3afb  thread_start

This logic has been present for 7 years. Commit: https://github.com/yugabyte/yugabyte-db/commit/8e8849d00ed84d27bde5b7286517951f469f3110

Code: https://github.com/yugabyte/yugabyte-db/blob/8e8849d00ed84d27bde5b7286517951f469f3110/src/yb/master/catalog_manager.cc#L601-L608

druzac commented 3 months ago

I'm not sure what the other issues are concretely. I noticed all masters were wedged in the cluster while the master leader was deadlocked doing the sys catalog reload. I don't understand why no other master became leader. The followers didn't seem to be deadlocked from the stack traces I looked at.