Open shishir2001-yb opened 3 months ago
There seem to be multiple issues here. One of them is the master leader deadlocks while reloading the sys catalog. The issue is a single thread deadlock, we hold exclusive access on CatalogManager::lock_
while aborting table tasks:
https://github.com/yugabyte/yugabyte-db/blob/da9b281c2e8e4ee702dc7e5030947d31b92d21c1/src/yb/master/catalog_manager.cc#L1342-L1347
The stack from the deadlocked thread:
@ 0xffffadf54b3f __GI___nanosleep
@ 0xaaaab39632a3 yb::master::CatalogManager::GetTabletInfos()
@ 0xaaaab3b4e113 yb::master::MasterSnapshotCoordinator::Impl::FinishRestoration()
@ 0xaaaab3b4ed1f _ZZN2yb6master12_GLOBAL__N_116MakeDoneCallbackIN5boost11multi_index21multi_index_containerINSt3__110unique_ptrINS0_16RestorationStateENS6_14default_deleteIS8_EEEENS4_10indexed_byINS4_13hashed_uniqueINS4_13const_mem_funIS8_RKNS_17StronglyTypedUuidINS_28TxnSnapshotRestorationId_TagEEEXadL_ZNKS8_14restoration_idEvEEEEN4mpl_2naESM_SM_EENS4_17hashed_non_uniqueINS4_3tagINS0_25MasterSnapshotCoordinator4Impl11ScheduleTagESM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_EENSE_IS8_RKNSF_INS_22SnapshotScheduleId_TagEEEXadL_ZNKS8_11schedule_idEvEEEESM_SM_EESM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_SM_EENS6_9allocatorISB_EEEENS6_6__bindIMSR_FvPS8_lEJPSR_RKNS6_12placeholders4__phILi1EEERlEEEEEDaPNS6_5mutexEPKT_RKNS1I_8key_typeERKNS6_12basic_stringIcNS6_11char_traitsIcEENS11_IcEEEERKT0_ENK11DoneFunctorclENS_6ResultIRKNS_7tserver26TabletSnapshotOpResponsePBEEE
@ 0xaaaab3b4f053 _ZNSt3__110__function6__funcIZN2yb6master12_GLOBAL__N_116MakeDoneCallbackIN5boost11multi_index21multi_index_containerINS_10unique_ptrINS3_16RestorationStateENS_14default_deleteISA_EEEENS7_10indexed_byINS7_13hashed_uniqueINS7_13const_mem_funISA_RKNS2_17StronglyTypedUuidINS2_28TxnSnapshotRestorationId_TagEEEXadL_ZNKSA_14restoration_idEvEEEEN4mpl_2naESO_SO_EENS7_17hashed_non_uniqueINS7_3tagINS3_25MasterSnapshotCoordinator4Impl11ScheduleTagESO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_EENSG_ISA_RKNSH_INS2_22SnapshotScheduleId_TagEEEXadL_ZNKSA_11schedule_idEvEEEESO_SO_EESO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_SO_EENS_9allocatorISD_EEEENS_6__bindIMST_FvPSA_lEJPST_RKNS_12placeholders4__phILi1EEERlEEEEEDaPNS_5mutexEPKT_RKNS1K_8key_typeERKNS_12basic_stringIcNS_11char_traitsIcEENS13_IcEEEERKT0_E11DoneFunctorNS13_IS20_EEFvNS2_6ResultIRKNS2_7tserver26TabletSnapshotOpResponsePBEEEEEclEOS27_
@ 0xaaaab38d461b yb::master::AsyncTabletSnapshotOp::Finished()
@ 0xaaaab38ad68f yb::master::RetryingRpcTask::AbortAndReturnPrevState()
@ 0xaaaab38f3ee7 yb::master::CatalogEntityWithTasks::AbortTasksAndCloseIfRequested()
@ 0xaaaab3924337 yb::master::CatalogManager::VisitSysCatalog()
@ 0xaaaab39215eb yb::master::CatalogManager::LoadSysCatalogDataTask()
@ 0xaaaab4a729f7 yb::ThreadPool::DispatchThread()
@ 0xaaaab4a6f177 yb::Thread::SuperviseThread()
@ 0xffffade778b7 start_thread
@ 0xffffaded3afb thread_start
This logic has been present for 7 years. Commit: https://github.com/yugabyte/yugabyte-db/commit/8e8849d00ed84d27bde5b7286517951f469f3110
I'm not sure what the other issues are concretely. I noticed all masters were wedged in the cluster while the master leader was deadlocked doing the sys catalog reload. I don't understand why no other master became leader. The followers didn't seem to be deadlocked from the stack traces I looked at.
Jira Link: DB-12281
Description
Version: 2.23.0.0-b361 Logs: Added in Jira
We have observed an issue where the system hangs during the abort process of a RESTORE_ON_TABLET task, leading to prolonged leader role changes and incomplete catalog loading. (Code link)
Considering "master_snapshot_coordinator.cc:1813] Setting restore complete time" is the last log message we saw about aborting restore, I’m guessing the thread hang somewhere before reaching here.
Test details:
G-flags:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information