yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.86k stars 1.05k forks source link

[DocDB] SIGSEGV in yb-tserver 'NotifyReplicationFinished'. Multiple ReplicationFinished() callbacks triggered in retryable requests #23501

Open shamanthchandra-yb opened 4 weeks ago

shamanthchandra-yb commented 4 weeks ago

Jira Link: DB-12414

Description

Encountered a segmentation fault in the yb-tserver process within the NotifyReplicationFinished function. On bird view, from log, this crash appears to be because of handling of retryable requests, related to the ReplicationFinished() callback being invoked multiple times.

Here is the core backtrace:

thread #1, name = 'yb-tserver', stop reason = signal SIGSEGV
  * frame #0: 0x000055cecbcdd5e8 yb-tserver`yb::consensus::ConsensusRound::NotifyReplicationFinished(this=0x00001449eb5b32c0, status=0x00007f8d7a797620, leader_term=-1, applied_op_ids=0x0000000000000000) at consensus_round.cc:47:14
    frame #1: 0x000055cecbd39664 yb-tserver`yb::consensus::RetryableRequests::ReplicationFinished(yb::consensus::LWReplicateMsg const&, yb::Status const&, long) at retryable_requests.cc:490:18
    frame #2: 0x000055cecbd394c7 yb-tserver`yb::consensus::RetryableRequests::ReplicationFinished(this=<unavailable>, replicate_msg=0x0000144b43ebace0, status=0x00007f8d7a797750, leader_term=-1) at retryable_requests.cc:797:10
    frame #3: 0x000055cecbd2c105 yb-tserver`yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked(yb::OpId const&, yb::StronglyTypedBool<yb::consensus::CouldStop_Tag>) [inlined] yb::consensus::ReplicaState::NotifyReplicationFinishedUnlocked(this=0x00001449b22e5a40, round=0x00007f8d7a7976d8, status=0x00007f8d7a797750, leader_term=-1, applied_op_ids=0x00007f8d7a797730) at replica_state.cc:1463:23
    frame #4: 0x000055cecbd2c0d2 yb-tserver`yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked(this=0x00001449b22e5a40, committed_op_id=0x00007f8d7a797980, could_stop=(value_ = true)) at replica_state.cc:970:5
    frame #5: 0x000055cecbd2b459 yb-tserver`yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked(this=0x00001449b22e5a40, committed_op_id=0x00007f8d7a797980, could_stop=(value_ = true)) at replica_state.cc:882:17
    frame #6: 0x000055cecbd145d4 yb-tserver`yb::consensus::RaftConsensus::UpdateReplica(std::__1::shared_ptr<yb::consensus::LWConsensusRequestPB> const&, yb::consensus::LWConsensusResponsePB*) [inlined] yb::consensus::RaftConsensus::MarkOperationsAsCommittedUnlocked(this=0x00001449bc19f8d8, request=0x0000144a17cd0888, deduped_req=0x00007f8d7a7978f0, last_from_leader=<unavailable>) at raft_consensus.cc:2258:33
    frame #7: 0x000055cecbd13f94 yb-tserver`yb::consensus::RaftConsensus::UpdateReplica(this=0x00001449bc19f8d8, request_ptr=<unavailable>, response=0x0000144a17cd0978) at raft_consensus.cc:2015:3
    frame #8: 0x000055cecbcf4823 yb-tserver`yb::consensus::RaftConsensus::Update(this=0x00001449bc19f8d8, request_ptr=std::__1::shared_ptr<yb::consensus::LWConsensusRequestPB>::element_type @ 0x0000144a17cd0888, response=0x0000144a17cd0978, deadline=<unavailable>) at raft_consensus.cc:1581:14
    frame #9: 0x000055cecd01e0be yb-tserver`yb::tserver::ConsensusServiceImpl::UpdateConsensus(this=<unavailable>, req=0x0000144a17cd0888, resp=0x0000144a17cd0978, context=<unavailable>) at tablet_service.cc:2477:25
    frame #10: 0x000055cecbd8296e yb-tserver`std::__1::__function::__func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] yb::consensus::ConsensusServiceIf::InitMethods(this=<unavailable>, req=<unavailable>, resp=<unavailable>, rpc_context=RpcContext @ 0x00007f8d7a797dc0)::$_0::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>) const::'lambda'(yb::consensus::LWConsensusRequestPB const*, yb::consensus::LWConsensusResponsePB*, yb::rpc::RpcContext)::operator()(yb::consensus::LWConsensusRequestPB const*, yb::consensus::LWConsensusResponsePB*, yb::rpc::RpcContext) const at consensus.service.cc:288:9
    frame #11: 0x000055cecbd82930 yb-tserver`std::__1::__function::__func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) at local_call.h:126:7
    frame #12: 0x000055cecbd82354 yb-tserver`std::__1::__function::__func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] yb::consensus::ConsensusServiceIf::InitMethods(this=<unavailable>, call=<unavailable>)::$_0::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>) const at consensus.service.cc:286:7
    frame #13: 0x000055cecbd822d5 yb-tserver`std::__1::__function::__func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] decltype(__f=<unavailable>, __args=<unavailable>)::$_0&>()(std::declval<std::__1::shared_ptr<yb::rpc::InboundCall>>())) std::__1::__invoke[abi:ue170006]<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::__1::shared_ptr<yb::rpc::InboundCall>>(yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::__1::shared_ptr<yb::rpc::InboundCall>&&) at invoke.h:340:25
    frame #14: 0x000055cecbd822b4 yb-tserver`std::__1::__function::__func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<yb::consensus::ConsensusServiceIf::InitMethods(__args=<unavailable>, __args=<unavailable>)::$_0&, std::__1::shared_ptr<yb::rpc::InboundCall>>(yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0&, std::__1::shared_ptr<yb::rpc::InboundCall>&&) at invoke.h:415:5
    frame #15: 0x000055cecbd822b4 yb-tserver`std::__1::__function::__func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator()(std::__1::shared_ptr<yb::rpc::InboundCall>&&) [inlined] std::__1::__function::__alloc_func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator(this=<unavailable>, __arg=<unavailable>)[abi:ue170006](std::__1::shared_ptr<yb::rpc::InboundCall>&&) at function.h:192:16
    frame #16: 0x000055cecbd822b4 yb-tserver`std::__1::__function::__func<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0, std::__1::allocator<yb::consensus::ConsensusServiceIf::InitMethods(scoped_refptr<yb::MetricEntity> const&)::$_0>, void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator(this=<unavailable>, __arg=<unavailable>)(std::__1::shared_ptr<yb::rpc::InboundCall>&&) at function.h:363:12
    frame #17: 0x000055cecbd8359f yb-tserver`yb::consensus::ConsensusServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::__function::__value_func<void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator(this=<unavailable>, __args=nullptr)[abi:ue170006](std::__1::shared_ptr<yb::rpc::InboundCall>&&) const at function.h:517:16
    frame #18: 0x000055cecbd83580 yb-tserver`yb::consensus::ConsensusServiceIf::Handle(std::__1::shared_ptr<yb::rpc::InboundCall>) [inlined] std::__1::function<void (std::__1::shared_ptr<yb::rpc::InboundCall>)>::operator(this=<unavailable>, __arg=nullptr)(std::__1::shared_ptr<yb::rpc::InboundCall>) const at function.h:1168:12
    frame #19: 0x000055cecbd83580 yb-tserver`yb::consensus::ConsensusServiceIf::Handle(this=<unavailable>, call=<unavailable>) at consensus.service.cc:271:3
    frame #20: 0x000055ceccbf0130 yb-tserver`yb::rpc::ServicePoolImpl::Handle(this=0x00001449bf7e9440, incoming=<unavailable>) at service_pool.cc:269:19
    frame #21: 0x000055ceccb0c70f yb-tserver`yb::rpc::InboundCall::InboundCallTask::Run(this=<unavailable>) at inbound_call.cc:314:13
    frame #22: 0x000055ceccbff923 yb-tserver`yb::rpc::(anonymous namespace)::Worker::Execute(this=0x00001449bf5babd0) at thread_pool.cc:115:15
    frame #23: 0x000055cecd46e9a3 yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator(this=0x00001449be5262a0)[abi:ue170006]() const at function.h:517:16
    frame #24: 0x000055cecd46e98d yb-tserver`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator(this=0x00001449be5262a0)() const at function.h:1168:12
    frame #25: 0x000055cecd46e98d yb-tserver`yb::Thread::SuperviseThread(arg=0x00001449be526240) at thread.cc:866:3
    frame #26: 0x00007f8d963fc1ca libpthread.so.0`start_thread + 234
    frame #27: 0x00007f8d9664de73 libc.so.6`__clone + 67

These logs were around same timestamp, on that node:

W0814 07:14:46.142526  7637 ts_tablet_manager.cc:2051] T 32ce0133ec89495d86058db37c32376d P 2d428f411bdb446a8aefbeb752161cea: Failed to submit retryable requests task: Already present (yb/tablet/tablet_bootstrap_state_flusher.cc:140): Tablet 32ce0133ec89495d86058db37c32376d has a flush task still in progress

W0814 07:14:46.291299  1602 raft_consensus.cc:2100] T 32ce0133ec89495d86058db37c32376d P 2d428f411bdb446a8aefbeb752161cea [term 15 FOLLOWER]: StartReplicaOperationUnlocked failed: Illegal state (yb/consensus/replica_state.cc:733): Cannot register retryable request on follower

W0814 07:14:46.291330  1602 raft_consensus.cc:2117] T 32ce0133ec89495d86058db37c32376d P 2d428f411bdb446a8aefbeb752161cea [term 15 FOLLOWER]: Could not prepare operation for op: 15.182395. Suppressed 286 other warnings. Status for this op: Illegal state (yb/consensus/replica_state.cc:733): Cannot register retryable request on follower

E0814 07:14:46.334568  1602 consensus_round.cc:43] 0x1449eb5b32c0: callback(" << callback_ << ")->ReplicationFinished() has been already called for id { term: 15 index: 182395 } hybrid_time: 7059945302671347712 op_type: WRITE_OP write { unused_tablet_id: "" write_batch { transaction { transaction_id: "5DA39B2EC1E74DD493CA23C5EE97EACD" isolation: SERIALIZABLE_ISOLATION status_tablet: "ebdc7189283844cd8c022092daadf791" priority: 14326124569636204096 start_hybrid_time: 7059943152297328640 locality: GLOBAL pg_txn_start_us: 776934133665668 } deprecated_may_have_metadata: 1 read_pairs { key: "G\xfd<Sd3ed3167-b71b-47d7-b427-2f7401238072:271636:5i65iglmc3mg3cjm5mc3jhmch7mjfghinfmh6j76jhmnd4nh4dkn6nd4kindi8nkghijogni7k87kinoe5oi5elo7oe5ljoej9olhijkphoj:seed\x00\x00!!" value: "$" } table_schema_version { schema_version: 0 } } client_id1: 13782057669155686126 client_id2: 3038810002090317453 request_id: 2881287 min_running_request_id: 2881282 batch_idx: 0 start_time_micros: 1723618933728671 } committed_op_id { term: 15 index: 182394 } monotonic_counter: 0

Please find run details in JIRA.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

Huqicheng commented 4 weeks ago

For duplicate retryable requests, we call NotifyReplicationFinished for multiple times. I can see at least two of them: First, it's pushed back to the duplicate_rounds, and will be called when the first registered round is replicated

if (!emplace_result.second) {
      emplace_result.first->duplicate_rounds.push_back(round);
      return false;
    }
for (const auto& duplicate : running_it->duplicate_rounds) {
      duplicate->NotifyReplicationFinished(status_for_duplicate, leader_term,
                                           nullptr /* applied_op_ids */);
    }

The other place is

  auto result = state_->AddPendingOperation(round_ptr, OperationMode::kFollower);
  if (!result.ok()) {
    round_ptr->NotifyReplicationFinished(result, OpId::kUnknownTerm, /* applied_op_ids */ nullptr);
  }