yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.04k stars 1.08k forks source link

[YSQL] High memory usage in transaction locking and conflict resolution functions leading to potential OOM in tserver #24915

Open archit-rastogi opened 1 week ago

archit-rastogi commented 1 week ago

Jira Link: DB-14046

Description

Please find slack thread in JIRA description.

Observed on below branches: 2024.1.3.0, 2024.1.3.1, 2024.1.4.0

High memory consumption has been observed in the transaction locking and conflict resolution functions within the tserver, resulting in a potential risk of OutOfMemory (OOM) errors. Heap snapshots show significant allocations.

Screenshot 2024-11-14 at 4 31 06 PM Screenshot 2024-11-14 at 3 27 16 PM

Example:

Memory usage details of node n2, which went into OOM:

This is a jump of over ~2.6 GB in 1.5 minutes, all as untracked memory usage on the tserver.

The difference is because of below call stacks, when compared to heap snapshot before increase.

Estimated 1,137,550,736 bytes = ~1.13 GB

tcmalloc::tcmalloc_internal::SampleifyAllocation<>()
slow_alloc<>()
TCMallocInternalNew
yb::tablet::TransactionParticipant::Impl::LockAndFind()
yb::tablet::TransactionParticipant::RequestStatusAt()
yb::docdb::(anonymous namespace)::ConflictResolver::DoResolveConflicts()
yb::docdb::(anonymous namespace)::ConflictResolver::Resolve()
yb::docdb::(anonymous namespace)::WaitOnConflictResolver::Run()
yb::tablet::TabletPeer::WriteAsync()
yb::tserver::PerformRead()
yb::tserver::TabletServiceImpl::Read()
std::__1::__function::__func<>::operator()()
yb::tserver::TabletServerServiceIf::Handle()
yb::rpc::ServicePoolImpl::Handle()
yb::rpc::InboundCall::InboundCallTask::Run()
yb::rpc::(anonymous namespace)::Worker::Execute()
yb::thread::SuperviseThread()
start_thread

Estimated 1,021,223,616 bytes = 1.02 GB

tcmalloc::tcmalloc_internal::SampleifyAllocation<>()
slow_alloc<>()
TCMallocInternalNew
yb::tablet::TransactionParticipant::Impl::LockAndFind()
yb::tablet::TransactionParticipant::RequestStatusAt()
yb::docdb::TransactionStatusCache::GetTransactionLocalState()
yb::docdb::IntentAwareIterator::ProcessIntent()
yb::docdb::IntentAwareIterator::SeekToSuitableIntent<>()
yb::docdb::IntentAwareIterator::Revalidate()
yb::docdb::DocRowwiseIterator::AdvanceIteratorToNextDesiredRow()
yb::docdb::DocRowwiseIterator::FetchNextImpl<>()
yb::docdb::DocRowwiseIterator::PgFetchNext()
yb::docdb::(anonymous namespace)::FilteringIterator::FetchNext()
yb::docdb::PgsqlReadOperation::Execute()
yb::tablet::Tablet::HandlePgsqlReadRequest()
yb::tserver::(anonymous namespace)::ReadQuery::Complete()
yb::tserver::(anonymous namespace)::ReadQuery::Run()
yb::rpc::(anonymous namespace)::Worker::Execute()
yb::thread::SuperviseThread()
start_thread

Estimated 360,168,816 bytes = 360 MB

tcmalloc::tcmalloc_internal::SampleifyAllocation<>()
slow_alloc<>()
TCMallocInternalNew
yb::tablet::TransactionParticipant::Impl::LockAndFind()
yb::tablet::TransactionParticipant::RequestStatusAt()
yb::docdb::(anonymous namespace)::ConflictResolver::DoResolveConflicts()
yb::docdb::(anonymous namespace)::ConflictResolver::Resolve()
yb::docdb::(anonymous namespace)::WaitOnConflictResolver::Run()
yb::tablet::TabletPeer::WriteAsync()
yb::tserver::TabletServiceImpl::PerformWrite()
yb::tserver::TabletServiceImpl::Write()
std::__1::__function::__func<>::operator()()
yb::tserver::TabletServerServiceIf::Handle()
yb::rpc::ServicePoolImpl::Handle()
yb::rpc::InboundCall::InboundCallTask::Run()
yb::rpc::(anonymous namespace)::Worker::Execute()
yb::thread::SuperviseThread()
start_thread

ALL above sums to ~2.51 GB, which is the same as increased memory.

Allocations seems to be coming from wait on conflict. Need more investigation where and why its coming from.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information