yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.96k stars 1.07k forks source link

[DocDB] Tserver FATAL *handle == calls_.end(), for failed geo promotions. #17113

Closed rthallamko3 closed 10 months ago

rthallamko3 commented 1 year ago

Jira Link: DB-6398

Description

If a transaction fails to be promoted, then the rollback of the transaction can result in FATALs.

F20230501 22:15:22 ../../src/yb/rpc/rpc.cc:359] Check failed: *handle == calls_.end()
    @     0x55d980d047d7  google::LogMessage::SendToLog()
    @     0x55d980d0571d  google::LogMessage::Flush()
    @     0x55d980d05c29  google::LogMessageFatal::~LogMessageFatal()
    @     0x55d9817728d9  yb::rpc::Rpcs::RegisterAndStart()
    @     0x55d980f96727  yb::client::YBTransaction::Impl::SendAbortToOldStatusTabletIfNeeded()
    @     0x55d980f9889d  yb::client::YBTransaction::Impl::UpdateTransactionStatusLocationDone()
    @     0x55d980f98ee8  std::__1::__function::__func<>::operator()()
    @     0x55d980fb23bf  yb::client::(anonymous namespace)::TransactionRpcBase::Finished()
    @     0x55d980fb2790  std::__1::__function::__func<>::operator()()
    @     0x55d981752b13  yb::rpc::OutboundCall::InvokeCallbackSync()
    @     0x55d9817563ab  yb::rpc::OutboundCall::InvokeCallback()
    @     0x55d9817489af  yb::rpc::LocalYBInboundCall::Respond()
    @     0x55d981773d7e  yb::rpc::RpcContext::RespondSuccess()
    @     0x55d981ae8618  yb::tserver::TabletServiceImpl::UpdateTransactionStatusLocation()
    @     0x55d981bd7ede  std::__1::__function::__func<>::operator()()
    @     0x55d981be081f  yb::tserver::TabletServerServiceIf::Handle()
    @     0x55d9817fee9e  yb::rpc::ServicePoolImpl::Handle()
    @     0x55d981744faf  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x55d98180da03  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0x55d981e5461f  yb::thread::SuperviseThread()
    @     0x7fa9029da694  start_thread
    @     0x7fa902edc41d  __clone

Note that this was seen on a geo partitioned cluster, that didn't have the fix for https://github.com/yugabyte/yugabyte-db/issues/16108.

Even with the fix in https://github.com/yugabyte/yugabyte-db/issues/16108, it would be good to avoid the fatal, in case it happens in other cases.

Warning: Please confirm that this issue does not contain any sensitive information

rthallamko3 commented 10 months ago

Jepsen tests being run with geo-partitioning runs into it. If we want to run Jepsen tests on 2.18 and 2.20 branches, we would need to backport the fix to this to 2.20 and 2.18 branches as well.