Closed mangesh-at-yb closed 1 year ago
@mangesh-at-yb it seems we are core dumping while trying to prepare an error that the raft queue is full
Service unavailable (yb/rpc/service_pool.cc:229): RequestConsensusVote request on yb.consensus.ConsensusService from 172.151.27.24:44954 dropped due to backpressure. The service queue is full
Was this universe on Portal, so we could inspect logs & metrics for it? The raft queue getting full is pretty strange. That would suggest either the system is otherwise overloaded (CPU or disk bottlenecked), or there's some other issue in the system (eg: some form of bottleneck / slowdown in raft). cc @rthallamko3
Higher level, why are we testing with 2.15.1, rather than the latest build?
Yes, these universes are on portal. we are using this build as our some of the tests were conducted with that build, and will be testing with latest stable builds as we go on. Please refer metrics page screenshot attached to DB-4145, we have not seen any major resource utilisation, but have a look at it, and let us know if our understanding is wrong..
@hbhanawat , Do you know if this issue is repros on 2.17 builds? If not, can we close this?
Not reproducible with latest builds. Closing this issue.
Jira Link: DB-4145
observed "TabletServer_re" killed with Seg Fault (SIGSEGV) while running "sysbench workload" with "oltp_read_only", during CREATE phase, number of tables set to "1500". Please note that, its "m6i.4xlarge" instance and never observed OOM.
Setup:
Three m6i.4xlarge nodes, RF=3 cluster running with YB version: 2.15.1.0-b175
Below gFalgs set :
Steps:
Clone sysbench repo and build sysbench on one of the client machines, recommending c5.2xlarge due to heavy workload.
Configure node=3 and RF=3 YB-Cluster with said gFlags.
Execute below commands from client machine :
Cleanup: ( if cluster has used previously )
CREATE phase: After which cores can be observed.
Sysbench reports below error, during "CREATE TABLE sbtest894'"
Observed:
PC: @ 0x0 (unknown) SIGSEGV (@0x8) received by PID 3594 (TID 0x7fed36196700) from PID 8; stack trace: @ 0x3ae1786 yb::Status::CloneAndAddErrorCode() @ 0x3460e48 yb::rpc::OutboundCall::SetFailed() @ 0x344a3b2 yb::rpc::Connection::HandleCallResponse() @ 0x3524b94 yb::rpc::YBOutboundConnectionContext::HandleCall() @ 0x343bf7e yb::rpc::BinaryCallParser::Parse() @ 0x3526468 yb::rpc::YBOutboundConnectionContext::ProcessCalls() @ 0x3445390 yb::rpc::Connection::ProcessReceived() @ 0x34797ca yb::rpc::RefinedStream::ProcessReceived() @ 0x347a062 yb::rpc::RefinedStream::ProcessReceived() @ 0x3479c90 yb::rpc::RefinedStream::Read() @ 0x34797b4 yb::rpc::RefinedStream::ProcessReceived() @ 0x347a062 yb::rpc::RefinedStream::ProcessReceived() @ 0x35175f0 yb::rpc::TcpStream::TryProcessReceived() @ 0x3519b70 yb::rpc::TcpStream::Handler() @ 0x34360ec ev_invoke_pending @ 0x3439d0a ev_run @ 0x3471f2a yb::rpc::Reactor::RunThread() @ 0x3af3ec1 yb::Thread::SuperviseThread() @ 0x7fed3c3c4694 start_thread @ 0x7fed3c8c641d __clone