yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.92k stars 1.06k forks source link

[DocDB][Perf][m6i.4xlarge][Sysbench][oltp_read_only] observed Tablsetserver_re and killed with "SIGSEGV" and multiple cores observed while executing CREATE with oltp_read_only workload, after which cluster became non-responsive. #14848

Closed mangesh-at-yb closed 1 year ago

mangesh-at-yb commented 1 year ago

Jira Link: DB-4145

observed "TabletServer_re" killed with Seg Fault (SIGSEGV) while running "sysbench workload" with "oltp_read_only", during CREATE phase, number of tables set to "1500". Please note that, its "m6i.4xlarge" instance and never observed OOM.

Setup:

sysbench oltp_read_only --table_size=500000 --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=1500 --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=90 --time=1800 --pgsql-password=Password321! --db-driver=pgsql --pgsql-db=yugabyte --pgsql-port=5433 --pgsql-host=172.151.20.20,172.151.31.105,172.151.27.24 --threads=1 cleanup;
PQexec() failed: 7 Timed out: Timed out waiting for Create Table", "FATAL: failed query was: CREATE TABLE sbtest894(", "  id SERIAL,", "  k INTEGER DEFAULT '0' NOT NULL,", "  c CHAR(120) DEFAULT '' NOT NULL,", "  pad CHAR(60) DEFAULT '' NOT NULL,", "  PRIMARY KEY (id ASC)", ")  SPLIT AT VALUES((20833),(41666),(62500),(83333),(104166),(125000),(145833),(166666),(187500),(208333),(229166),(250000),(270833),(291666),(312500),(333333),(354166),(375000),(395833),(416666),(437500),(458333),(479166))", "FATAL: `sysbench.cmdline.call_command' function failed: ./src/lua/oltp_common.lua:255: SQL error, errno = 0, state = 'XX000': Timed out: Timed out waiting for Create Table"]}

Observed:


- **tserver.err:** reported for 3594 on node "ip-172-151-27-24": 

PC: @ 0x0 (unknown) SIGSEGV (@0x8) received by PID 3594 (TID 0x7fed36196700) from PID 8; stack trace: @ 0x3ae1786 yb::Status::CloneAndAddErrorCode() @ 0x3460e48 yb::rpc::OutboundCall::SetFailed() @ 0x344a3b2 yb::rpc::Connection::HandleCallResponse() @ 0x3524b94 yb::rpc::YBOutboundConnectionContext::HandleCall() @ 0x343bf7e yb::rpc::BinaryCallParser::Parse() @ 0x3526468 yb::rpc::YBOutboundConnectionContext::ProcessCalls() @ 0x3445390 yb::rpc::Connection::ProcessReceived() @ 0x34797ca yb::rpc::RefinedStream::ProcessReceived() @ 0x347a062 yb::rpc::RefinedStream::ProcessReceived() @ 0x3479c90 yb::rpc::RefinedStream::Read() @ 0x34797b4 yb::rpc::RefinedStream::ProcessReceived() @ 0x347a062 yb::rpc::RefinedStream::ProcessReceived() @ 0x35175f0 yb::rpc::TcpStream::TryProcessReceived() @ 0x3519b70 yb::rpc::TcpStream::Handler() @ 0x34360ec ev_invoke_pending @ 0x3439d0a ev_run @ 0x3471f2a yb::rpc::Reactor::RunThread() @ 0x3af3ec1 yb::Thread::SuperviseThread() @ 0x7fed3c3c4694 start_thread @ 0x7fed3c8c641d __clone



- Seen multiple cores as well and universe became non responsive.

### Expected 

- Need to analyse why system went in to such state, **Please not that, we never hit OOM on both the machines and these are "m6i.4xlarge" instance. Creating "1500" tables 
bmatican commented 1 year ago

@mangesh-at-yb it seems we are core dumping while trying to prepare an error that the raft queue is full

Service unavailable (yb/rpc/service_pool.cc:229): RequestConsensusVote request on yb.consensus.ConsensusService from 172.151.27.24:44954 dropped due to backpressure. The service queue is full

Was this universe on Portal, so we could inspect logs & metrics for it? The raft queue getting full is pretty strange. That would suggest either the system is otherwise overloaded (CPU or disk bottlenecked), or there's some other issue in the system (eg: some form of bottleneck / slowdown in raft). cc @rthallamko3

Higher level, why are we testing with 2.15.1, rather than the latest build?

mangesh-at-yb commented 1 year ago

Yes, these universes are on portal. we are using this build as our some of the tests were conducted with that build, and will be testing with latest stable builds as we go on. Please refer metrics page screenshot attached to DB-4145, we have not seen any major resource utilisation, but have a look at it, and let us know if our understanding is wrong..

rthallamko3 commented 1 year ago

@hbhanawat , Do you know if this issue is repros on 2.17 builds? If not, can we close this?

hbhanawat commented 1 year ago

Not reproducible with latest builds. Closing this issue.