Open sanketkedia opened 2 years ago
I did some blackbox testing on pitr and create table DDLs. Script is here. The script runs against an existing cluster.
Steps:
restore_ts
)restore_ts
or after, and whether the table is present post-restore or notrestore_ts
but not present post-restore or vice versaIn this version of the test, I don't see any unexpected tables. However, the master is occasionally put into an unstable state after the test. CPU utilization of yb-master
is consistently near 100% and some yb-admin
commands take a long time. This persists for anywhere for a few minutes after the test is run to more than half an hour. The warning log is full of messages like:
W0630 11:07:57.519742 1877831680 long_operation_tracker.cc:117] ScopedRWOperation running for 1.003s in thread 0x1708af000:
@ 0x1ada9b4a3 _sigtramp
@ 0x1174244bb yb::ThreadRestrictions::AssertIOAllowed()
@ 0x1171aa9b7 yb::PosixSequentialFile::Read()
@ 0x1109d1b5f rocksdb::SequentialFileReader::Read()
@ 0x1109ca24f rocksdb::CopyFile()
@ 0x1109f4cdb rocksdb::checkpoint::CreateCheckpoint()
@ 0x107dbd047 yb::tablet::TabletSnapshots::CreateCheckpoint()
@ 0x107dbba47 yb::tablet::TabletSnapshots::Create()
@ 0x1041a6d3b yb::master::MasterSnapshotCoordinator::Impl::CreateReplicated()
@ 0x1041a6327 yb::master::MasterSnapshotCoordinator::CreateReplicated()
@ 0x1080055d7 yb::tablet::SnapshotOperation::Apply()
@ 0x108006313 yb::tablet::SnapshotOperation::DoReplicated()
@ 0x107fc0ba3 yb::tablet::Operation::Replicated()
@ 0x107fe9223 yb::tablet::OperationDriver::ApplyTask()
@ 0x107fe8e2f yb::tablet::OperationDriver::ReplicationFinished()
@ 0x106a32487 yb::consensus::ConsensusRound::NotifyReplicationFinished()
W0630 11:07:58.183712 1877831680 long_operation_tracker.cc:117] TSHeartbeat running for 1.001s in thread 0x2e6353000:
@ 0x1ada9b4a3 _sigtramp
@ 0x1ada81383 _pthread_mutex_firstfit_lock_wait
@ 0x1ada7ecf7 _pthread_mutex_firstfit_lock_slow
@ 0x1ad9d51a7 std::__1::mutex::lock()
@ 0x106af884f std::__1::unique_lock<>::unique_lock()
@ 0x106ae6fcb std::__1::unique_lock<>::unique_lock()
@ 0x106ae63c7 yb::consensus::ReplicaState::LockForRead()
@ 0x106aa3023 yb::consensus::RaftConsensus::ConsensusState()
@ 0x10402dde7 yb::master::ScopedLeaderSharedLock::ScopedLeaderSharedLock()
@ 0x10402e7b7 yb::master::ScopedLeaderSharedLock::ScopedLeaderSharedLock()
@ 0x10402e84f yb::master::ScopedLeaderSharedLock::ScopedLeaderSharedLock()
@ 0x104130c73 yb::master::(anonymous namespace)::MasterHeartbeatServiceImpl::TSHeartbeat()
@ 0x113ea9eeb _ZZZN2yb6master17MasterHeartbeatIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENK3$_0clENSt3__110shared_ptrINS_3rpc11InboundCallEEEENKUlPKNS0_20TSHeartbeatRequestPBEPNS0_21TSHeartbeatResponsePBENSA_10RpcContextEE_clESF_SH_SI_
@ 0x113ea9adf _ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_6master20TSHeartbeatRequestPBENS3_21TSHeartbeatResponsePBEEEZZNS3_17MasterHeartbeatIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENK3$_0clENSt3__110shared_ptrINS0_11InboundCallEEEEUlPKS4_PS5_NS0_10RpcContextEE_EEDaSH_T0_
@ 0x113ea976f yb::master::MasterHeartbeatIf::InitMethods()::$_0::operator()()
@ 0x113ea96ab _ZNSt3__1L8__invokeIRZN2yb6master17MasterHeartbeatIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEE3$_0JNS_10shared_ptrINS1_3rpc11InboundCallEEEEEEDTclscT_fp_spscT0_fp0_EEOSF_DpOSG_
W0630 11:08:01.521366 1877831680 long_operation_tracker.cc:117] Create snapshot running for 5.005s in thread 0x1708af000:
@ 0x1ada9b4a3 _sigtramp
@ 0x10912528b std::__1::__hash_table<>::find<>()
@ 0x10911dbeb std::__1::unordered_map<>::find()
@ 0x10911d86b yb::docdb::ConsensusFrontier::Update()
@ 0x110735877 rocksdb::UpdateUserFrontier()
@ 0x1107a380f rocksdb::VersionSet::Recover()
@ 0x1092306bf yb::docdb::RocksDBPatcher::Impl::Load()
@ 0x109230627 yb::docdb::RocksDBPatcher::Load()
@ 0x107dbbc37 yb::tablet::TabletSnapshots::Create()
@ 0x1041a6d3b yb::master::MasterSnapshotCoordinator::Impl::CreateReplicated()
@ 0x1041a6327 yb::master::MasterSnapshotCoordinator::CreateReplicated()
@ 0x1080055d7 yb::tablet::SnapshotOperation::Apply()
@ 0x108006313 yb::tablet::SnapshotOperation::DoReplicated()
@ 0x107fc0ba3 yb::tablet::Operation::Replicated()
@ 0x107fe9223 yb::tablet::OperationDriver::ApplyTask()
@ 0x107fe8e2f yb::tablet::OperationDriver::ReplicationFinished()
W0630 11:08:13.871166 1888153600 long_operation_tracker.cc:146] ScopedRWOperation took a long time: 17.355s
W0630 11:08:13.871187 1888153600 long_operation_tracker.cc:146] Create snapshot took a long time: 17.355s
W0630 11:08:13.876030 1874391040 yb_rpc.cc:392] Call yb.master.MasterService.GetTableSchema 127.0.0.1:57549 => 127.0.0.1:7100 (request call id 446183) took 6705ms (client timeout 4999ms).
W0630 11:08:13.876154 1945415680 scoped_leader_shared_lock.cc:174] Long lock of catalog manager (../../src/yb/master/master_cluster_service.cc:140, GetMasterRegistration): 1.702s
W0630 11:08:13.876133 3869126656 scoped_leader_shared_lock.cc:174] Long lock of catalog manager (../../src/yb/master/master_cluster_service.cc:140, GetMasterRegistration): 16.388s
@ 0x10402e87c yb::master::ScopedLeaderSharedLock::~ScopedLeaderSharedLock()
@ 0x10402ecd4 yb::master::ScopedLeaderSharedLock::~ScopedLeaderSharedLock()
@ 0x10410e964 yb::master::(anonymous namespace)::MasterClusterServiceImpl::GetMasterRegistration()
@ 0x113de1f70 _ZZZN2yb6master15MasterClusterIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENK3$_4clENSt3__110shared_ptrINS_3rpc11InboundCallEEEENKUlPKNS0_30GetMasterRegistrationRequestPBEPNS0_31GetMasterRegistrationResponsePBENSA_10RpcContextEE_clESF_SH_SI_
@ 0x113de1ce8 _ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_6master30GetMasterRegistrationRequestPBENS3_31GetMasterRegistrationResponsePBEEEZZNS3_15MasterClusterIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENK3$_4clENSt3__110shared_ptrINS0_11InboundCallEEEEUlPKS4_PS5_NS0_10RpcContextEE_EEDaSH_T0_
@ 0x113de19e0 yb::master::MasterClusterIf::InitMethods()::$_4::operator()()
@ 0x113de191c _ZNSt3__1L8__invokeIRZN2yb6master15MasterClusterIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEE3$_4JNS_10shared_ptrINS1_3rpc11InboundCallEEEEEEDTclscT_fp_spscT0_fp0_EEOSF_DpOSG_
@ 0x113de18b4 std::__1::__invoke_void_return_wrapper<>::__call<>()
@ 0x113de1874 std::__1::__function::__alloc_func<>::operator()()
@ 0x113de040c std::__1::__function::__func<>::operator()()
@ 0x103938120 std::__1::__function::__value_func<>::operator()()
@ 0x1039315c4 std::__1::function<>::operator()()
@ 0x113dd0964 yb::master::MasterClusterIf::Handle()
@ 0x11319be98 yb::rpc::ServicePoolImpl::Handle()
@ 0x11306f108 yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x1131d0dc4 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x1131d2ec0 _ZNSt3__1L8__invokeIRMN2yb3rpc12_GLOBAL__N_16WorkerEFvvERPS4_JEvEEDTcldsdescT0_fp0_fp_spscT1_fp1_EEOT_OSA_DpOSB_
@ 0x1131d2e44 _ZNSt3__1L15__apply_functorIMN2yb3rpc12_GLOBAL__N_16WorkerEFvvENS_5tupleIJPS4_EEEJLm0EENS7_IJEEEEENS_13__bind_returnIT_T0_T2_Xsr22__is_valid_bind_returnISC_SD_SE_EE5valueEE4typeERSC_RSD_NS_15__tuple_indicesIJXspT1_EEEEOSE_
@ 0x1131d2dfc _ZNSt3__16__bindIRKMN2yb3rpc12_GLOBAL__N_16WorkerEFvvEJRKPS4_EEclIJEEENS_13__bind_returnIS6_NS_5tupleIJS9_EEENSF_IJDpOT_EEEXsr22__is_valid_bind_returnIS6_SG_SK_EE5valueEE4typeESJ_
@ 0x1131d2db8 _ZNSt3__1L8__invokeIRNS_6__bindIRKMN2yb3rpc12_GLOBAL__N_16WorkerEFvvEJRKPS5_EEEJEEEDTclscT_fp_spscT0_fp0_EEOSF_DpOSG_
@ 0x1131d2d70 std::__1::__invoke_void_return_wrapper<>::__call<>()
@ 0x1131d2d48 std::__1::__function::__alloc_func<>::operator()()
@ 0x1131d1ae4 std::__1::__function::__func<>::operator()()
@ 0x10443c3d8 std::__1::__function::__value_func<>::operator()()
@ 0x10440b51c std::__1::function<>::operator()()
@ 0x117406de0 yb::Thread::SuperviseThread()
@ 0x1ada8426c _pthread_start
@ 0x1ada7f08c thread_start
W0630 11:08:13.876113 1874391040 scoped_leader_shared_lock.cc:174] Long lock of catalog manager (../../src/yb/master/master_ddl_service.cc:60, GetTableSchema): 6.705s
@ 0x10402e87c yb::master::ScopedLeaderSharedLock::~ScopedLeaderSharedLock()
@ 0x10402ecd4 yb::master::ScopedLeaderSharedLock::~ScopedLeaderSharedLock()
@ 0x1041224b0 _ZN2yb6master17MasterServiceBase14HandleOnLeaderINS0_23GetTableSchemaRequestPBENS0_24GetTableSchemaResponsePBEZNS1_8HandleInINS0_14CatalogManagerES3_S4_EEvPKT0_PT1_PNS_3rpc10RpcContextEMT_FNS_6StatusES9_SB_EPKciSK_NS_17StronglyTypedBoolINS0_19HoldCatalogLock_TagEEEEUlvE_EEvPKSF_PS7_SE_SA_SK_iSK_SN_
@ 0x104122304 yb::master::MasterServiceBase::HandleIn<>()
@ 0x10411f26c yb::master::(anonymous namespace)::MasterDdlServiceImpl::GetTableSchema()
@ 0x113fc7a24 _ZZZN2yb6master11MasterDdlIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENK3$_3clENSt3__110shared_ptrINS_3rpc11InboundCallEEEENKUlPKNS0_23GetTableSchemaRequestPBEPNS0_24GetTableSchemaResponsePBENSA_10RpcContextEE_clESF_SH_SI_
@ 0x113fc779c _ZN2yb3rpc10HandleCallINS0_19RpcCallPBParamsImplINS_6master23GetTableSchemaRequestPBENS3_24GetTableSchemaResponsePBEEEZZNS3_11MasterDdlIf11InitMethodsERK13scoped_refptrINS_12MetricEntityEEENK3$_3clENSt3__110shared_ptrINS0_11InboundCallEEEEUlPKS4_PS5_NS0_10RpcContextEE_EEDaSH_T0_
@ 0x113fc7494 yb::master::MasterDdlIf::InitMethods()::$_3::operator()()
@ 0x113fc73d0 _ZNSt3__1L8__invokeIRZN2yb6master11MasterDdlIf11InitMethodsERK13scoped_refptrINS1_12MetricEntityEEE3$_3JNS_10shared_ptrINS1_3rpc11InboundCallEEEEEEDTclscT_fp_spscT0_fp0_EEOSF_DpOSG_
@ 0x113fc7368 std::__1::__invoke_void_return_wrapper<>::__call<>()
@ 0x113fc7328 std::__1::__function::__alloc_func<>::operator()()
@ 0x113fc5ec0 std::__1::__function::__func<>::operator()()
@ 0x103938120 std::__1::__function::__value_func<>::operator()()
@ 0x1039315c4 std::__1::function<>::operator()()
@ 0x113fb905c yb::master::MasterDdlIf::Handle()
@ 0x11319be98 yb::rpc::ServicePoolImpl::Handle()
@ 0x11306f108 yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x1131d0dc4 yb::rpc::(anonymous namespace)::Worker::Execute()
@ 0x1131d2ec0 _ZNSt3__1L8__invokeIRMN2yb3rpc12_GLOBAL__N_16WorkerEFvvERPS4_JEvEEDTcldsdescT0_fp0_fp_spscT1_fp1_EEOT_OSA_DpOSB_
@ 0x1131d2e44 _ZNSt3__1L15__apply_functorIMN2yb3rpc12_GLOBAL__N_16WorkerEFvvENS_5tupleIJPS4_EEEJLm0EENS7_IJEEEEENS_13__bind_returnIT_T0_T2_Xsr22__is_valid_bind_returnISC_SD_SE_EE5valueEE4typeERSC_RSD_NS_15__tuple_indicesIJXspT1_EEEEOSE_
@ 0x1131d2dfc _ZNSt3__16__bindIRKMN2yb3rpc12_GLOBAL__N_16WorkerEFvvEJRKPS4_EEclIJEEENS_13__bind_returnIS6_NS_5tupleIJS9_EEENSF_IJDpOT_EEEXsr22__is_valid_bind_returnIS6_SG_SK_EE5valueEE4typeESJ_
@ 0x1131d2db8 _ZNSt3__1L8__invokeIRNS_6__bindIRKMN2yb3rpc12_GLOBAL__N_16WorkerEFvvEJRKPS5_EEEJEEEDTclscT_fp_spscT0_fp0_EEOSF_DpOSG_
@ 0x1131d2d70 std::__1::__invoke_void_return_wrapper<>::__call<>()
@ 0x1131d2d48 std::__1::__function::__alloc_func<>::operator()()
@ 0x1131d1ae4 std::__1::__function::__func<>::operator()()
@ 0x10443c3d8 std::__1::__function::__value_func<>::operator()()
@ 0x10440b51c std::__1::function<>::operator()()
@ 0x117406de0 yb::Thread::SuperviseThread()
@ 0x1ada8426c _pthread_start
@ 0x1ada7f08c thread_start
A previous version of the test would drop all previously created tables to get a clean test state instead of dropping and recreating the test database. This version experienced missing tables after the restore (tables where CREATE TABLE was ACK'd to the client up to a hundred milliseconds before the restore_ts
), and more frequent instability. The master process was wedged and I needed to restart the cluster. I haven't seen this since changing the cleanup logic, so I suspect this was related to issuing many DROP TABLE DDLs in a short period of time.
Jira Link: DB-681
Description
We should harden our restore semantics and behavior in cases when the user restores to a time when a DDL was ongoing.