yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.93k stars 1.06k forks source link

[YSQL] TSAN issues around PG signal handling #15950

Open AnjanaRatnayake opened 1 year ago

AnjanaRatnayake commented 1 year ago

Jira Link: DB-5361

Description

I think this is a data-race bug, got this by compiling the master branch with tsan and creating a cluster with 3 yb-masters and 3 yb-tservers.

WARNING: ThreadSanitizer: data race (pid=2)'
  Read of size 8 at 0x7b5000100030 by thread T25:'
    #0 cds::gc::dhp::retired_array::init() ??:? (libcds.so.2.3.3+0xd768)'
    #1 cds::gc::dhp::smr::alloc_thread_data() ??:? (libcds.so.2.3.3+0xc192)'
    #2 cds::gc::dhp::smr::attach_thread() ??:? (libcds.so.2.3.3+0xbe80)'
    #3 cds::threading::ThreadData::init() ??:? (libcds.so.2.3.3+0xdc4b)'
    #4 yb::Thread::SuperviseThread(void*) ??:? (libyb_util.so+0x354e4d)'
'

  Previous write of size 8 at 0x7b5000100030 by thread T16:'
    #0 memset ??:? (yb-tserver+0x90e8d)'
    #1 cds::gc::dhp::retired_array::fini() ??:? (libcds.so.2.3.3+0xd4a4)'
    #2 cds::gc::dhp::smr::free_thread_data(cds::gc::dhp::smr::thread_record*, bool) ??:? (libcds.so.2.3.3+0xc48a)'
    #3 cds::gc::dhp::smr::detach_thread() ??:? (libcds.so.2.3.3+0xc23c)'
    #4 cds::threading::ThreadData::fini() ??:? (libcds.so.2.3.3+0xe089)'
    #5 yb::Thread::FinishThread(void*) ??:? (libyb_util.so+0x355193)'
    #6 yb::Thread::SuperviseThread(void*) ??:? (libyb_util.so+0x354e7c)'
'
  As if synchronized via sleep:'
    #0 nanosleep ??:? (yb-tserver+0x8480d)'
    #1 yb::Thread::SuperviseThread(void*) ??:? (libyb_util.so+0x3544ad)'
'
  Location is heap block of size 488 at 0x7b5000100000 allocated by thread T16:'
    #0 operator new[](unsigned long) ??:? (yb-tserver+0x1068d6)'
    #1 cds::gc::dhp::(anonymous namespace)::default_alloc_memory(unsigned long) dhp.cpp:? (libcds.so.2.3.3+0xd281)'
    #2 cds::gc::dhp::smr::alloc_thread_data() ??:? (libcds.so.2.3.3+0xbfb6)'
    #3 cds::gc::dhp::smr::attach_thread() ??:? (libcds.so.2.3.3+0xbe80)'
    #4 cds::threading::ThreadData::init() ??:? (libcds.so.2.3.3+0xdc4b)'
    #5 yb::Thread::SuperviseThread(void*) ??:? (libyb_util.so+0x354e4d)'
'
  Thread T25 'iotp_TabletServ' (tid=34, running) created by main thread at:'
    #0 pthread_create ??:? (yb-tserver+0x8701d)'
    #1 yb::Thread::StartThread(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void ()>, scoped_refptr<yb::Thread>*) ??:? (libyb_util.so+0x3535ea)'
    #2 yb::Result<scoped_refptr<yb::Thread> > yb::Thread::Make<std::__1::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*> >(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::__bind<void (yb::rpc::IoThreadPool::Impl::*)(), yb::rpc::IoThreadPool::Impl*>&&) ??:? (libyrpc.so+0xe93cf)'
    #3 yb::rpc::IoThreadPool::Impl::Impl(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long) ??:? (libyrpc.so+0xe87dd)'
    #4 yb::rpc::IoThreadPool::IoThreadPool(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long) ??:? (libyrpc.so+0xe8407)'
    #5 yb::rpc::Messenger::Messenger(yb::rpc::MessengerBuilder const&) ??:? (libyrpc.so+0xf638c)'
    #6 yb::rpc::MessengerBuilder::Build() ??:? (libyrpc.so+0xf0eb7)'
    #7 yb::server::RpcServerBase::Init() ??:? (libserver_process.so+0x8b8cc)'
    #8 yb::server::RpcAndWebServerBase::Init() ??:? (libserver_process.so+0x8e9b2)'
    #9 yb::tserver::DbServerBase::Init() ??:? (libtserver.so+0x144a2b)'
    #10 yb::tserver::TabletServer::Init() ??:? (libtserver.so+0x2011db)'
    #11 yb::tserver::TabletServerMain(int, char**) ??:? (libtserver_main_impl.so+0x12f4d)'
    #12 main ??:? (yb-tserver+0x1077de)'
'
  Thread T16 'ybclientcb [wor' (tid=19, finished) created by main thread at:'
    #0 pthread_create ??:? (yb-tserver+0x8701d)'
    #1 yb::Thread::StartThread(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void ()>, scoped_refptr<yb::Thread>*) ??:? (libyb_util.so+0x3535ea)'
    #2 yb::ThreadPool::CreateThreadUnlocked() ??:? (libyb_util.so+0x365a6a)'
    #3 yb::ThreadPool::Init() ??:? (libyb_util.so+0x361d4b)'
    #4 yb::ThreadPoolBuilder::Build(std::__1::unique_ptr<yb::ThreadPool, std::__1::default_delete<yb::ThreadPool> >*) const ??:? (libyb_util.so+0x361c11)'
    #5 yb::client::YBClientBuilder::DoBuild(yb::rpc::Messenger*, std::__1::unique_ptr<yb::client::YBClient, std::__1::default_delete<yb::client::YBClient> >*) ??:? (libyb_client.so+0x2cf30b)'
    #6 yb::client::YBClientBuilder::Build(yb::rpc::Messenger*) ??:? (libyb_client.so+0x2cf9d9)'
    #7 yb::AutoFlagsManager::LoadFromMaster(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::vector<yb::HostPort, std::__1::allocator<yb::HostPort> >, std::__1::allocator<std::__1::vector<yb::HostPort, std::__1::allocator<yb::HostPort> > > > const&, yb::StronglyTypedBool<yb::ApplyNonRuntimeAutoFlags_Tag>) ??:? (libyb_client.so+0x2b57a6)'
    #8 yb::tserver::TabletServer::InitAutoFlags() ??:? (libtserver.so+0x201fdb)'
    #9 yb::server::RpcAndWebServerBase::Init() ??:? (libserver_process.so+0x8e994)'
    #10 yb::tserver::DbServerBase::Init() ??:? (libtserver.so+0x144a2b)'
    #11 yb::tserver::TabletServer::Init() ??:? (libtserver.so+0x2011db)'
    #12 yb::tserver::TabletServerMain(int, char**) ??:? (libtserver_main_impl.so+0x12f4d)'
    #13 main ??:? (yb-tserver+0x1077de)'
'
SUMMARY: ThreadSanitizer: data race ??:? in cds::gc::dhp::retired_array::init()'
AnjanaRatnayake commented 1 year ago

This is the commit I was using to produce the bug: 45ca13c368cfaa1f1995656290c0eb74ce5cb931

bmatican commented 1 year ago

@AnjanaRatnayake how are you building the code with tsan? in our regular build, we use a suppression file (build-support/tsan-suppressions.txt), which includes race:cds::gc::dhp::retired_array, so this should not show up.

bmatican commented 1 year ago

Closing for now, as the suppressions should work to not bubble this up, but we could look further into this, if that's not working, for some reason.

AnjanaRatnayake commented 1 year ago

This is how I am building yugabyte with tsan. I am not touching any tsan-suppressions.txt.

export PATH=/usr/local/bin:$PATH && ./yb_build.sh tsan \
    --download-thirdparty \
    --ninja \
    --skip-tests \
    --clang15 \
    --no-linuxbrew
bmatican commented 1 year ago

If there was a data race in yb::ThreadSafeObjectPool<yb::internal::ArenaBase> would this be more likely to be an actual issue, since it is not on tsan_suppressions.txt?

@AnjanaRatnayake That might be so! Could you share a stack with that?

davidsearle-antithesis commented 1 year ago

@bmatican I'm part of the team working on this. It might be easier for us to show you our output via a zoom call. That way we can show you how we've hit the Yugabyte stack too, forcing this to occur. Can you do any time next week?

bmatican commented 1 year ago

@davidsearle-antithesis I checked internally with @mbautin and he clarified that you need to start the actual binary process, with a TSAN_OPTIONS env var, which specifies the suppression file. A snippet from our test script prep-work

  # Don't add a hyphen after the regex so we can handle both tsan and tsan_slow.
  if [[ $build_root_basename =~ ^tsan ]]; then
    # Configure TSAN (ignored if this isn't a TSAN build).
    #
    # Deadlock detection (new in clang 3.5) is disabled because:
    # 1. The clang 3.5 deadlock detector crashes in some YB unit tests. It
    #    needs compiler-rt commits c4c3dfd, 9a8efe3, and possibly others.
    # 2. Many unit tests report lock-order-inversion warnings; they should be
    #    fixed before reenabling the detector.
    TSAN_OPTIONS="detect_deadlocks=0"
    TSAN_OPTIONS+=" suppressions=$YB_SRC_ROOT/build-support/tsan-suppressions.txt"
    TSAN_OPTIONS+=" history_size=7"
    TSAN_OPTIONS+=" external_symbolizer_path=$ASAN_SYMBOLIZER_PATH"
    if [[ ${YB_SANITIZERS_ENABLE_COREDUMP:-0} == "1" ]]; then
      TSAN_OPTIONS+=" disable_coredump=false"
    fi
    export TSAN_OPTIONS
  fi

Could you try that out and see if it works as expected?

davidsearle-antithesis commented 1 year ago

Sure - thing. We're running this now and will let you know.

AnjanaRatnayake commented 1 year ago

@bmatican After running your code with tsan suppressions, these were some of the thread related issues we were able to find. Do any of theses have potential to be thread related bugs? To show you all the issues we came across and noted it would be a lot easier to hop on a zoom call.

ThreadSanitizer: data race thread_pool.cc:? in yb::rpc::(anonymous namespace)::Worker::Execute()
ThreadSanitizer: signal-unsafe call inside of a signal ??:? in __interceptor_calloc
ThreadSanitizer: data race ??:? in SwitchBackToLocalLatch
ThreadSanitizer: data race xact.c:? in CommitTransaction
ThreadSanitizer: data race ??:? in disable_timeout
ThreadSanitizer: data race ??:? in WaitEventSetWait
bmatican commented 1 year ago

@AnjanaRatnayake Sure, would be happy to connect! Perhaps to make it easier to coordinate, can you join our public slack workspace on yugabyte.com/slack and drop a note in #yb-users, so we can discuss further?

In the meantime, if you could upload the respective stack traces, for the 6 issues above, that would already help kickstart some internal investigations. Some might be happening rarely even in our test suite, but we just have not triaged / observed them yet.

AnjanaRatnayake commented 1 year ago

Hey @bmatican I have sent you the full stack traces on the Yugabyte slack. Let me know what you think and if you need anything else.

bmatican commented 1 year ago

Uploading the stacks here, for easier collaboration: https://gist.github.com/bmatican/6962120c6a84ac8ec61a9ec6c52f84a6

deeps1991 commented 1 year ago

@AnjanaRatnayake @davidsearle-antithesis We recently fixed some issues with signal handling here: #15925 Some of the TSAN stacks reported by you are similar, so it looks like the same root cause.

To be sure, please could you run the same test with the latest master (or any commit after this) and let me know if you still see the issue?

davidsearle-antithesis commented 1 year ago

Hi Deepthi,

Can we arrange a call to discuss please?

-Dave

On Fri, Jun 2, 2023 at 3:39 PM Deepthi @.***> wrote:

@AnjanaRatnayake https://github.com/AnjanaRatnayake @davidsearle-antithesis https://github.com/davidsearle-antithesis We recently fixed some issues with signal handling here: #15925 https://github.com/yugabyte/yugabyte-db/issues/15925 Please could you run the same test with the latest master (or any commit after this https://github.com/yugabyte/yugabyte-db/commit/56922a80abca531c6e1fddf5d5f513100a208ef2) and let me know if you still see the issue?

— Reply to this email directly, view it on GitHub https://github.com/yugabyte/yugabyte-db/issues/15950#issuecomment-1573848125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AT4N6Z46A6ART4LWNWU24VLXJH3LXANCNFSM6AAAAAAUO6EVT4 . You are receiving this because you were mentioned.Message ID: @.***>

--

----------------------------- This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.