[DocDB][Perf][m6i.4xlarge][Sysbench][oltp_read_only] observed Tablsetserver_re and killed with "SIGSEGV" and multiple cores observed while executing CREATE with oltp_read_only workload, after which cluster became non-responsive.

mangesh-at-yb commented 1 year ago

Jira Link: DB-4145

observed "TabletServer_re" killed with Seg Fault (SIGSEGV) while running "sysbench workload" with "oltp_read_only", during CREATE phase, number of tables set to "1500". Please note that, its "m6i.4xlarge" instance and never observed OOM.

Setup:

Three m6i.4xlarge nodes, RF=3 cluster running with YB version: 2.15.1.0-b175

Below gFalgs set :

    masterGFlags:
      - "name": "enable_automatic_tablet_splitting"
        "value": true
      - "name": "ysql_enable_packed_row"
        "value": true
      - "name": "default_memory_limit_to_ram_ratio"
        "value": 0.10
    tserverGFlags:
      - "name": "enable_automatic_tablet_splitting"
        "value": true
      - "name": "ysql_enable_packed_row"
        "value": true
      - "name": "default_memory_limit_to_ram_ratio"
        "value": 0.70
      - "name": "db_block_cache_size_percentage"
        "value": 50

Steps:

Clone sysbench repo and build sysbench on one of the client machines, recommending c5.2xlarge due to heavy workload.
```
git clone --branch scanworkloads https://github.com/yugabyte/sysbench.git
```
Configure node=3 and RF=3 YB-Cluster with said gFlags.
Execute below commands from client machine :
Cleanup: ( if cluster has used previously )

sysbench oltp_read_only --table_size=500000 --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=1500 --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=90 --time=1800 --pgsql-password=Password321! --db-driver=pgsql --pgsql-db=yugabyte --pgsql-port=5433 --pgsql-host=172.151.20.20,172.151.31.105,172.151.27.24 --threads=1 cleanup;

CREATE phase: After which cores can be observed.

sysbench oltp_read_only --table_size=500000 --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=1500 --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=90 --time=1800 --pgsql-password=Password321! --db-driver=pgsql --pgsql-db=yugabyte --pgsql-port=5433 --pgsql-host=172.151.20.20,172.151.31.105,172.151.27.24 --threads=1 create;

Sysbench reports below error, during "CREATE TABLE sbtest894'"

PQexec() failed: 7 Timed out: Timed out waiting for Create Table", "FATAL: failed query was: CREATE TABLE sbtest894(", "  id SERIAL,", "  k INTEGER DEFAULT '0' NOT NULL,", "  c CHAR(120) DEFAULT '' NOT NULL,", "  pad CHAR(60) DEFAULT '' NOT NULL,", "  PRIMARY KEY (id ASC)", ")  SPLIT AT VALUES((20833),(41666),(62500),(83333),(104166),(125000),(145833),(166666),(187500),(208333),(229166),(250000),(270833),(291666),(312500),(333333),(354166),(375000),(395833),(416666),(437500),(458333),(479166))", "FATAL: `sysbench.cmdline.call_command' function failed: ./src/lua/oltp_common.lua:255: SQL error, errno = 0, state = 'XX000': Timed out: Timed out waiting for Create Table"]}

Observed:

core_TabletServer_re.3594 on node "ip-172-151-27-24": Cause "Segmentation fault".

gdb -q -nh -iex "set auto-load safe-path /" /home/yugabyte/tserver/bin/yb-tserver ./core_TabletServer_re.3594


[Thread debugging using libthread_db enabled]
Using host libthread_db library "/home/yugabyte/yb-software/yugabyte-2.15.1.0-b175-centos-x86_64/linuxbrew/lib/libthread_db.so.1".
Core was generated by `/home/yugabyte/tserver/bin/yb-tserver --flagfile /home/yugabyte/tserver/conf/se'.
Program terminated with signal 11, Segmentation fault.
#0  Create<yb::Slice> (file_name=<optimized out>, code=<optimized out>, line_number=<optimized out>, msg=..., msg2=..., errors=..., dup_file_name=...)
at ../../src/yb/util/status.cc:339
339    ../../src/yb/util/status.cc: No such file or directory.
#0  Create<yb::Slice> (file_name=<optimized out>, code=<optimized out>, line_number=<optimized out>, msg=..., msg2=..., errors=..., dup_file_name=...)
at ../../src/yb/util/status.cc:339
339    ../../src/yb/util/status.cc: No such file or directory.
(gdb) bt full
#0  Create<yb::Slice> (file_name=<optimized out>, code=<optimized out>, line_number=<optimized out>, msg=..., msg2=..., errors=..., dup_file_name=...)
at ../../src/yb/util/status.cc:339
    size = 211
    len1 = 211
    file_name_size = 0
    len2 = 0
    errors_start = <optimized out>
    out = <optimized out>
#1  yb::Status::CloneAndAddErrorCode(yb::StatusErrorCode const&) const (this=0x13dd87770, error_code=...) at ../../src/yb/util/status.cc:592
    errors_slice = {
      begin_ = 0x13dd8769c "Service unavailable (yb/rpc/service_pool.cc:229): RequestConsensusVote request on yb.consensus.ConsensusService from 172.151.27.24:44954 dro
pped due to backpressure. The service queue is full, it has "..., end_ = 0x10000019a ""}
    new_errors_size = <optimized out>
    out = <optimized out>
    buffer = 0x7fed3618d810 "\017\004"
    inserted = false
    encoded_size = <optimized out>
#2  0x0000000003460e48 in yb::rpc::OutboundCall::SetFailed(yb::Status const&, std::__1::unique_ptr<yb::rpc::ErrorStatusPB, std::__1::default_delete<yb::rpc::ErrorStatusPB> 
>) (this=0xf412d420, status=..., err_pb=...) at ../../src/yb/rpc/outbound_call.cc:460
    invoke_callback = <optimized out>
#3  0x000000000344a3b2 in yb::rpc::Connection::HandleCallResponse(yb::rpc::CallData*) (this=<optimized out>, call_data=<optimized out>)
at ../../src/yb/rpc/outbound_call.cc:411
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    resp = {parsed_ = true, header_ = warning: RTTI symbol not found for class 'yb::rpc::ResponseHeader'
{<Message> = {<No data fields>}, static kIndexInFileMessages = 0, static kSidecarOffsetsFieldNumber = 3,
        static kCallIdFieldNumber = 1, static kIsErrorFieldNumber = 2,
        _internal_metadata_ = {<InternalMetadataWithArenaBase<google::protobuf::UnknownFieldSet, google::protobuf::internal::InternalMetadataWithArena>> = {
            ptr_ = 0x0, static kPtrTagMask = 1, static kPtrValueMask = -2}, <No data fields>}, _has_bits_ = {has_bits_ = {0}}, _cached_size_ = 0, sidecar_offsets_ = {
          static kInitialSize = 0, current_size_ = 0, total_size_ = 0, static kRepHeaderSize = 8, rep_ = 0x0}, call_id_ = 0, is_error_ = false},
      serialized_response_ = {
        begin_ = 0x13db73f19 "\n\323\001Service unavailable (yb/rpc/service_pool.cc:229): RequestConsensusVote request on yb.consensus.ConsensusService from 172.151.27.24:44954 dropped due to backpressure. The service queue is full, it h"..., end_ = 0x13db73ff1 ""},
      sidecar_bounds_ = {<small_vector_base<const unsigned char *, void, void>> = {<vector<const unsigned char *, boost::container::small_vector_allocator<const unsigned char *, boost::container::new_allocator<void>, void>, void>> = {
            m_holder = {<small_vector_allocator<const unsigned char *, boost::container::new_allocator<void>, void>> = {<new_allocator<const unsigned char *>> = {<No data fields>}, <No data fields>}, m_start = 0x7fed3618d970, m_size = 0, m_capacity = 16}}, static final_alignment = 8, m_storage_start = {aligner = {
              data = "\000\000\000\000\000\000\000"}, data = "\000\000\000\000\000\000\000"}}, <> = {m_rest_of_storage = {{aligner = {
                data = "\000\000\000\000\000\000\000"}, data = "\000\000\000\000\000\000\000"} <repeats 11 times>, {aligner = {data = "@\332\030\066\355\177\000"},
              data = "@\332\030\066\355\177\000"}, {aligner = {data = "M\241\022\003\000\000\000"}, data = "M\241\022\003\000\000\000"}, {aligner = {
                data = "\a.\202YS\035", <incomplete sequence \360>}, data = "\a.\202YS\035", <incomplete sequence \360>}, {aligner = {data = "\230G\267<?\"\365\006"},
              data = "\230G\267<?\"\365\006"}}}, static needed_extra_storages = 1, static needed_bytes = 48, static header_bytes = 24, static s_start = 24,
        static static_capacity = 2}, response_data_ = {buffer_ = {data_ = 0x0}}}
    awaiting = <optimized out>
    call = {__ptr_ = 0xf412d420, __cntrl_ = 0xf412d400}
#4  0x0000000003524b94 in non-virtual thunk to yb::rpc::YBOutboundConnectionContext::HandleCall(std::__1::shared_ptr<yb::rpc::Connection> const&, yb::rpc::CallData*) ()
at ../../src/yb/rpc/yb_rpc.cc:507
---Type <return> to continue, or q <return> to quit---
    boost::optional_ns::in_place_init_if = {<No data fields>}
    boost::optional_ns::in_place_init = {<No data fields>}
    yb::rpc::(anonymous namespace)::kConnectionHeaderBytes = "YB\001"
    fLB::FLAGS_noenable_rpc_keepalive = true
    yb::StronglyTypedBool<yb::rpc::SkipEmptyMessages_Tag>::kTrue = {static kTrue = {static kTrue = <same as static member of an already seen type>,
        static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = 84},
      static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = true}
    yb::StronglyTypedBool<yb::rpc::IncludeHeader_Tag>::kFalse = {static kTrue = {static kTrue = <same as static member of an already seen type>,
        static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = 84},
      static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = false}
    fLU64::FLAGS_nomin_sidecar_buffer_size = 16384
    fLU64::FLAGS_min_sidecar_buffer_size = 16384
    fLU64::FLAGS_noTEST_yb_inbound_big_calls_parse_delay_ms = 0
    fLB::FLAGS_enable_rpc_keepalive = true
    fLU64::FLAGS_TEST_yb_inbound_big_calls_parse_delay_ms = 0
#5  0x000000000343bf7e in yb::rpc::BinaryCallParser::Parse(std::__1::shared_ptr<yb::rpc::Connection> const&, boost::container::small_vector<iovec, 4ul, void, void> const&, yb::StronglyTypedBool<yb::rpc::ReadBufferFull_Tag>, std::__1::shared_ptr<yb::MemTracker> const*) (this=0x228a7288, connection=..., data=..., read_buffer_full=...,
tracker_for_throttle=0x0) at ../../src/yb/rpc/binary_call_parser.cc:168
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    vlocal__ = 0x42fa1f0 <google::kLogSiteUninitialized>
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    LOG_THROTTLER_139 = {num_suppressed_ = 0, last_ts_ = 0}
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    full_input_size = 229
    consumed = <optimized out>
    header_size = 4
    body_offset = 4
#6  0x0000000003526468 in yb::rpc::YBOutboundConnectionContext::ProcessCalls(std::__1::shared_ptr<yb::rpc::Connection> const&, boost::container::small_vector<iovec, 4ul, void, void> const&, yb::StronglyTypedBool<yb::rpc::ReadBufferFull_Tag>) (this=<optimized out>, connection=..., data=..., read_buffer_full=...)
at ../../src/yb/rpc/yb_rpc.cc:528
No locals.
#7  0x0000000003445390 in yb::rpc::Connection::ProcessReceived(yb::StronglyTypedBool<yb::rpc::ReadBufferFull_Tag>) (this=0x229e05b8, read_buffer_full=...)
at ../../src/yb/rpc/connection.cc:316
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    result = {success_ = 120, {status_ = {state_ = {px = 0x7fed3618ddb8}}, value_ = {consumed = 140656791575992, buffer = {begin_ = 0xffffffff "",
            end_ = 0x7fed3618dd10 "(\335\030\066\355\177"}, bytes_to_skip = 55165097}}}
#8  0x00000000034797ca in yb::rpc::RefinedStream::ProcessReceived(yb::StronglyTypedBool<yb::rpc::ReadBufferFull_Tag>) (this=0x225bda40, read_buffer_full=...)
at ../../src/yb/rpc/refined_stream.cc:151
No locals.
#9  0x000000000347a062 in non-virtual thunk to yb::rpc::RefinedStream::ProcessReceived(yb::StronglyTypedBool<yb::rpc::ReadBufferFull_Tag>) ()
No locals.
---Type <return> to continue, or q <return> to quit---
#10 0x0000000003479c90 in yb::rpc::RefinedStream::Read() (this=0x225bd960) at ../../src/yb/rpc/refined_stream.cc:316
    vlocal__ = 0x42fa1f0 <google::kLogSiteUninitialized>
    vlocal__ = 0x42fa1f0 <google::kLogSiteUninitialized>
#11 0x00000000034797b4 in yb::rpc::RefinedStream::ProcessReceived(yb::StronglyTypedBool<yb::rpc::ReadBufferFull_Tag>) (this=0x225bd960, read_buffer_full=...)
at ../../src/yb/rpc/refined_stream.cc:157
No locals.
#12 0x000000000347a062 in non-virtual thunk to yb::rpc::RefinedStream::ProcessReceived(yb::StronglyTypedBool<yb::rpc::ReadBufferFull_Tag>) ()
No locals.
#13 0x00000000035175f0 in yb::rpc::TcpStream::TryProcessReceived() (this=0x2273f8c0) at ../../src/yb/rpc/tcp_stream.cc:406
    read_buffer = @0x225bd9d0: warning: RTTI symbol not found for class 'yb::rpc::CircularReadBuffer'
{_vptr$StreamReadBuffer = 0x2187f90 <vtable for yb::rpc::CircularReadBuffer+16>}
#14 0x0000000003519b70 in yb::rpc::TcpStream::Handler(ev::io&, int) (this=0x2273f8c0, watcher=..., revents=1) at ../../src/yb/rpc/tcp_stream.cc:332
    vlocal__ = 0x42fa1f0 <google::kLogSiteUninitialized>
    vlocal__ = 0x4600a68 <fLI::FLAGS_v>
    vlocal__ = 0x42fa1f0 <google::kLogSiteUninitialized>
    status = {state_ = {px = 0x0}}
#15 0x00000000034360ec in ev_invoke_pending ()
    boost::optional_ns::in_place_init = {<No data fields>}
    boost::optional_ns::in_place_init_if = {<No data fields>}
    yb::StronglyTypedBool<yb::AlreadyConsumed_Tag>::kTrue = {static kTrue = {static kTrue = <same as static member of an already seen type>,
        static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = 84},
      static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = true}
    fLI64::FLAGS_norpc_throttle_threshold_bytes = 1048576
    fLB::FLAGS_nobinary_call_parser_reject_on_mem_tracker_hard_limit = true
    fLI64::FLAGS_rpc_throttle_threshold_bytes = 1048576
    fLB::FLAGS_binary_call_parser_reject_on_mem_tracker_hard_limit = true
#16 0x0000000003439d0a in ev_run ()
    boost::optional_ns::in_place_init = {<No data fields>}
    boost::optional_ns::in_place_init_if = {<No data fields>}
    yb::StronglyTypedBool<yb::AlreadyConsumed_Tag>::kTrue = {static kTrue = {static kTrue = <same as static member of an already seen type>,
        static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = 84},
      static kFalse = <same as static member of an already seen type>, static kValues = <optimized out>, value_ = true}
    fLI64::FLAGS_norpc_throttle_threshold_bytes = 1048576
    fLB::FLAGS_nobinary_call_parser_reject_on_mem_tracker_hard_limit = true
    fLI64::FLAGS_rpc_throttle_threshold_bytes = 1048576
    fLB::FLAGS_binary_call_parser_reject_on_mem_tracker_hard_limit = true
#17 0x0000000003471f2a in run (this=0x5c02e08, flags=0)
at /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220630123344-af96d73e39-almalinux8-x86_64-clang13-linuxbrew-full-lto/installed/common/include/ev++.h:211
No locals.
#18 yb::rpc::Reactor::RunThread() (this=0x5c02d80) at ../../src/yb/rpc/reactor.cc:498
    vlocal__ = 0x42fa1f0 <google::kLogSiteUninitialized>
#19 0x0000000003af3ec1 in yb::Thread::SuperviseThread(void*) (arg=<optimized out>)
at /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20220630123344-af96d73e39-almalinux8-x86_64-clang13-linuxbrew-full-lto/installed/uninstrumented/libcxx/include/c++/v1/__functional/function.h:498
    occurrences_740 = 0
    occurrences_mod_n_740 = 0
---Type <return> to continue, or q <return> to quit---
    name = {<__basic_string_common<true>> = {<No data fields>}, static __short_mask = 1, static __long_mask = 1, __r_ = {<> = {__value_ = {{__l = {__cap_ = 49,
                __size_ = 25, __data_ = 0x5e6d530 "TabletServer_reactor-3612"}, __s = {{__size_ = 49 '1', __lx = 49 '1'},
                __data_ = "\000\000\000\000\000\000\000\031\000\000\000\000\000\000\000\060\325\346\005\000\000\000"}, __r = {__words = {49, 25,
                  99013936}}}}}, <> = {<allocator<char>> = {<> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>},
      static npos = 18446744073709551615}
    thread_mgr_ref = {__ptr_ = 0x5bea118, __cntrl_ = 0x5bea100}
    system_tid = <optimized out>
    thread_ref = {ptr_ = 0x5ea2900}
    loop_count = <optimized out>
#20 0x00007fed3c3c4694 in start_thread (arg=0x7fed36196700) at pthread_create.c:333
    __res = <optimized out>
    pd = 0x7fed36196700
    now = <optimized out>
    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140656791611136, -7473150436250656469, 0, 140734271704847, 99012928, 140656791611136, 7462576131535364395,
            7462589090475761963}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
    not_first_call = <optimized out>
    pagesize_m1 = <optimized out>
    sp = <optimized out>
    freesize = <optimized out>
    __PRETTY_FUNCTION__ = "start_thread"
#21 0x00007fed3c8c641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.


- **tserver.err:** reported for 3594 on node "ip-172-151-27-24":

PC: @ 0x0 (unknown) SIGSEGV (@0x8) received by PID 3594 (TID 0x7fed36196700) from PID 8; stack trace: @ 0x3ae1786 yb::Status::CloneAndAddErrorCode() @ 0x3460e48 yb::rpc::OutboundCall::SetFailed() @ 0x344a3b2 yb::rpc::Connection::HandleCallResponse() @ 0x3524b94 yb::rpc::YBOutboundConnectionContext::HandleCall() @ 0x343bf7e yb::rpc::BinaryCallParser::Parse() @ 0x3526468 yb::rpc::YBOutboundConnectionContext::ProcessCalls() @ 0x3445390 yb::rpc::Connection::ProcessReceived() @ 0x34797ca yb::rpc::RefinedStream::ProcessReceived() @ 0x347a062 yb::rpc::RefinedStream::ProcessReceived() @ 0x3479c90 yb::rpc::RefinedStream::Read() @ 0x34797b4 yb::rpc::RefinedStream::ProcessReceived() @ 0x347a062 yb::rpc::RefinedStream::ProcessReceived() @ 0x35175f0 yb::rpc::TcpStream::TryProcessReceived() @ 0x3519b70 yb::rpc::TcpStream::Handler() @ 0x34360ec ev_invoke_pending @ 0x3439d0a ev_run @ 0x3471f2a yb::rpc::Reactor::RunThread() @ 0x3af3ec1 yb::Thread::SuperviseThread() @ 0x7fed3c3c4694 start_thread @ 0x7fed3c8c641d __clone



- Seen multiple cores as well and universe became non responsive.

### Expected 

- Need to analyse why system went in to such state, **Please not that, we never hit OOM on both the machines and these are "m6i.4xlarge" instance. Creating "1500" tables

bmatican commented 1 year ago

@mangesh-at-yb it seems we are core dumping while trying to prepare an error that the raft queue is full

Service unavailable (yb/rpc/service_pool.cc:229): RequestConsensusVote request on yb.consensus.ConsensusService from 172.151.27.24:44954 dropped due to backpressure. The service queue is full

Was this universe on Portal, so we could inspect logs & metrics for it? The raft queue getting full is pretty strange. That would suggest either the system is otherwise overloaded (CPU or disk bottlenecked), or there's some other issue in the system (eg: some form of bottleneck / slowdown in raft). cc @rthallamko3

Higher level, why are we testing with 2.15.1, rather than the latest build?

mangesh-at-yb commented 1 year ago

Yes, these universes are on portal. we are using this build as our some of the tests were conducted with that build, and will be testing with latest stable builds as we go on. Please refer metrics page screenshot attached to DB-4145, we have not seen any major resource utilisation, but have a look at it, and let us know if our understanding is wrong..

rthallamko3 commented 1 year ago

@hbhanawat , Do you know if this issue is repros on 2.17 builds? If not, can we close this?

hbhanawat commented 1 year ago

Not reproducible with latest builds. Closing this issue.

yugabyte / yugabyte-db