yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.95k stars 1.07k forks source link

[DocDB] Segmentation Fault in yb-master: CatalogManager::GetClusterConfig(), mutex issue #21775

Closed shishir2001-yb closed 5 months ago

shishir2001-yb commented 6 months ago

Jira Link: DB-10650

Description

Version: 2.23.0.0-b91 Logs: https://drive.google.com/file/d/1iRVJhFbKwnfpUP_I3vhLgLSEF5QLD3RS/view?usp=sharing

Test Description:
        1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Create a PITR schedule on 10 random database
        4. Start a while loop and run it for 120 mins
          a. Note down time fr PITR(0) 
          b. Create a backup of 1 random database
          c. Start the cross DB DDL workload and stop it after 10 mins
          d. Note down the time for PITR(1)
          e. Start the cross DB DDL workload and run it for 10 mins
          f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago).
          g. Restore to PITR(1)
          h. Validate data
          i. Restore to PITR(0) with a probability of 0.6 and validate data
          j. Delete the PITR schedule for the backup db 
          k. Drop the database 
          l. Restore the backup
          m. Create the snapshot schedule for this new DB

Full Coredump: https://drive.google.com/file/d/19SkYXdNq4lMJZRUAAsGUwKadX_con4dQ/view?usp=sharing

(lldb) target create "/home/yugabyte/master/bin/yb-master" --core "/home/yugabyte/cores/core_31280_1711999503_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b91-almalinux8-aarch64!bin!yb-server"
Core file '/home/yugabyte/cores/core_31280_1711999503_!home!yugabyte!yb-software!yugabyte-2.23.0.0-b91-almalinux8-aarch64!bin!yb-server' (aarch64) was loaded.
(lldb) bt all
* thread #1, name = 'yb-master', stop reason = signal SIGSEGV
  * frame #0: 0x0000ffffa1729f60 libpthread.so.0`__pthread_mutex_lock + 16
    frame #1: 0x0000aaaae5ad57f4 yb-master`yb::master::CatalogManager::GetClusterConfig(yb::master::SysClusterConfigEntryPB*) [inlined] yb::Mutex::Acquire(this=0x0000000000000008) at mutex.cc:85:12
    frame #2: 0x0000aaaae5ad57f0 yb-master`yb::master::CatalogManager::GetClusterConfig(yb::master::SysClusterConfigEntryPB*) [inlined] yb::MutexLock::MutexLock(this=<unavailable>, lock=0x0000000000000008) at mutex.h:107:12
    frame #3: 0x0000aaaae5ad57f0 yb-master`yb::master::CatalogManager::GetClusterConfig(yb::master::SysClusterConfigEntryPB*) [inlined] yb::RWCLock::ReadLock(this=0x0000000000000008) at rwc_lock.cc:74:13
    frame #4: 0x0000aaaae5ad57f0 yb-master`yb::master::CatalogManager::GetClusterConfig(yb::master::SysClusterConfigEntryPB*) [inlined] yb::CowObject<yb::master::PersistentClusterConfigInfo>::ReadLock(this=0x0000000000000008) const at cow_object.h:59:11
    frame #5: 0x0000aaaae5ad57f0 yb-master`yb::master::CatalogManager::GetClusterConfig(yb::master::SysClusterConfigEntryPB*) [inlined] yb::CowReadLock<yb::master::PersistentClusterConfigInfo>::CowReadLock(this=0x0000ffff88976738, cow=0x0000000000000008) at cow_object.h:172:11
    frame #6: 0x0000aaaae5ad57e8 yb-master`yb::master::CatalogManager::GetClusterConfig(yb::master::SysClusterConfigEntryPB*) [inlined] yb::master::MetadataCowWrapper<yb::master::PersistentClusterConfigInfo>::LockForRead(this=0x0000000000000000) const at catalog_entity_base.h:84:41
    frame #7: 0x0000aaaae5ad57e4 yb-master`yb::master::CatalogManager::GetClusterConfig(this=<unavailable>, config=0x0000ffff88976798) at catalog_manager.cc:12480:28
    frame #8: 0x0000aaaae5c80900 yb-master`yb::master::BasicCollector::Collect(this=0x000024677f8d4440, collection_level=MEDIUM) at master_call_home.cc:37:48
    frame #9: 0x0000aaaae63c4354 yb-master`yb::rpc::ScheduledTask<yb::CallHome::ScheduleCallHome(int)::$_0>::Run(yb::Status const&) [inlined] yb::Collector::Run(this=0x000024677f8d4440, level=MEDIUM) at call_home.cc:75:5
    frame #10: 0x0000aaaae63c4278 yb-master`yb::rpc::ScheduledTask<yb::CallHome::ScheduleCallHome(int)::$_0>::Run(yb::Status const&) at call_home.cc:214:20
    frame #11: 0x0000aaaae63c41b0 yb-master`yb::rpc::ScheduledTask<yb::CallHome::ScheduleCallHome(int)::$_0>::Run(yb::Status const&) [inlined] yb::CallHome::BuildJsonAndSend(this=<unavailable>) at call_home.cc:242:15
    frame #12: 0x0000aaaae63c41b0 yb-master`yb::rpc::ScheduledTask<yb::CallHome::ScheduleCallHome(int)::$_0>::Run(yb::Status const&) [inlined] yb::CallHome::DoCallHome(this=0x000024677f88b2d0) at call_home.cc:249:7
    frame #13: 0x0000aaaae63c4194 yb-master`yb::rpc::ScheduledTask<yb::CallHome::ScheduleCallHome(int)::$_0>::Run(yb::Status const&) [inlined] yb::CallHome::ScheduleCallHome(int)::$_0::operator()(this=<unavailable>, status=<unavailable>) const at call_home.cc:273:9
    frame #14: 0x0000aaaae63c4194 yb-master`yb::rpc::ScheduledTask<yb::CallHome::ScheduleCallHome(int)::$_0>::Run(this=<unavailable>, status=<unavailable>) at scheduler.h:55:5
    frame #15: 0x0000aaaae631fc38 yb-master`boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) [inlined] yb::rpc::Scheduler::Impl::HandleTimer(this=0x0000ffff88976a00)::'lambda'()::operator()() const at scheduler.cc:140:57
    frame #16: 0x0000aaaae631fc20 yb-master`boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) [inlined] void boost::asio::asio_handler_invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(function=0x0000ffff88976a00) at handler_invoke_hook.hpp:88:3
    frame #17: 0x0000aaaae631fc20 yb-master`boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) [inlined] void boost_asio_handler_invoke_helpers::invoke<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(function=0x0000ffff88976a00, context=0x0000ffff88976a00) at handler_invoke_helpers.hpp:54:3
    frame #18: 0x0000aaaae631fc20 yb-master`boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0ul>>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) [inlined] void boost::asio::detail::handler_work<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0ul>, void>::complete<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'()>(this=<unavailable>, function=0x0000ffff88976a00, handler=0x0000ffff88976a00) at handler_work.hpp:520:7
    frame #19: 0x0000aaaae631fc20 yb-master`boost::asio::detail::completion_handler<yb::rpc::Scheduler::Impl::HandleTimer(boost::system::error_code const&)::'lambda'(), boost::asio::io_context::basic_executor_type<std::__1::allocator<void>, 0ul>>::do_complete(owner=0x000024677fc34220, base=0x000024677fc3ba40, (null)=<unavailable>, (null)=<unavailable>) at completion_handler.hpp:74:9
    frame #20: 0x0000aaaae62da52c yb-master`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::scheduler_operation::complete(this=0x000024677fc3ba40, owner=0x000024677fc34220, ec=0x0000ffff88976ba0, bytes_transferred=<unavailable>) at scheduler_operation.hpp:40:5
    frame #21: 0x0000aaaae62da518 yb-master`boost::asio::detail::scheduler::run(boost::system::error_code&) at scheduler.ipp:492:12
    frame #22: 0x0000aaaae62da254 yb-master`boost::asio::detail::scheduler::run(this=0x000024677fc34220, ec=0x0000ffff88976ba0) at scheduler.ipp:210:10
    frame #23: 0x0000aaaae62d9ab4 yb-master`yb::rpc::IoThreadPool::Impl::Execute() [inlined] boost::asio::io_context::run(this=<unavailable>, ec=0x0000ffff88976ba0) at io_context.ipp:71:16
    frame #24: 0x0000aaaae62d9ab0 yb-master`yb::rpc::IoThreadPool::Impl::Execute(this=<unavailable>) at io_thread_pool.cc:76:17
    frame #25: 0x0000aaaae6b99898 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:v170002](this=0x000024677dde0520) const at function.h:517:16
    frame #26: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000024677dde0520) const at function.h:1168:12
    frame #27: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(arg=0x000024677dde04c0) at thread.cc:853:3
    frame #28: 0x0000ffffa17278b8 libpthread.so.0`start_thread + 392
    frame #29: 0x0000ffffa1783afc libc.so.6`thread_start + 12
  thread #2, stop reason = signal 0
    frame #0: 0x0000ffffa172dc58 libpthread.so.0`pthread_cond_wait@@GLIBC_2.17 + 528
    frame #1: 0x0000aaaae6b9cd44 yb-master`yb::ThreadPool::DispatchThread(bool) [inlined] yb::ConditionVariable::Wait(this=0x000024677fdf55d8) const at condition_variable.cc:80:12
    frame #2: 0x0000aaaae6b9cd38 yb-master`yb::ThreadPool::DispatchThread(this=0x000024677fdf5500, permanent=true) at threadpool.cc:561:20
    frame #3: 0x0000aaaae6b99898 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:v170002](this=0x000024677fdb69e0) const at function.h:517:16
    frame #4: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000024677fdb69e0) const at function.h:1168:12
    frame #5: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(arg=0x000024677fdb6980) at thread.cc:853:3
    frame #6: 0x0000ffffa17278b8 libpthread.so.0`start_thread + 392
    frame #7: 0x0000ffffa1783afc libc.so.6`thread_start + 12
  thread #3, stop reason = signal 0
    frame #0: 0x0000ffffa172dc58 libpthread.so.0`pthread_cond_wait@@GLIBC_2.17 + 528
    frame #1: 0x0000aaaae62da338 yb-master`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] void boost::asio::detail::posix_event::wait<boost::asio::detail::conditionally_enabled_mutex::scoped_lock>(this=0x000024677fc34188, lock=0x0000ffff9dc16a68) at posix_event.hpp:119:7
    frame #2: 0x0000aaaae62da320 yb-master`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::conditionally_enabled_event::wait(this=0x000024677fc34180, lock=0x0000ffff9dc16a68) at conditionally_enabled_event.hpp:97:14
    frame #3: 0x0000aaaae62da318 yb-master`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::scheduler::do_run_one(this=0x000024677fc34110, lock=0x0000ffff9dc16a68, this_thread=0x0000ffff9dc16a90, ec=0x0000ffff9dc16ba0) at scheduler.ipp:501:21
    frame #4: 0x0000aaaae62da254 yb-master`boost::asio::detail::scheduler::run(this=0x000024677fc34110, ec=0x0000ffff9dc16ba0) at scheduler.ipp:210:10
    frame #5: 0x0000aaaae62d9ab4 yb-master`yb::rpc::IoThreadPool::Impl::Execute() [inlined] boost::asio::io_context::run(this=<unavailable>, ec=0x0000ffff9dc16ba0) at io_context.ipp:71:16
    frame #6: 0x0000aaaae62d9ab0 yb-master`yb::rpc::IoThreadPool::Impl::Execute(this=<unavailable>) at io_thread_pool.cc:76:17
    frame #7: 0x0000aaaae6b99898 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:v170002](this=0x000024677fdb6060) const at function.h:517:16
    frame #8: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000024677fdb6060) const at function.h:1168:12
    frame #9: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(arg=0x000024677fdb6000) at thread.cc:853:3
    frame #10: 0x0000ffffa17278b8 libpthread.so.0`start_thread + 392
    frame #11: 0x0000ffffa1783afc libc.so.6`thread_start + 12
  thread #4, stop reason = signal 0
    frame #0: 0x0000ffffa18341e8 libc.so.6`epoll_pwait + 136
    frame #1: 0x0000aaaae62dd384 yb-master`boost::asio::detail::epoll_reactor::run(this=0x000024677fd06960, usec=<unavailable>, ops=0x0000ffff9ec36af0) at epoll_reactor.ipp:501:20
    frame #2: 0x0000aaaae62da390 yb-master`boost::asio::detail::scheduler::run(boost::system::error_code&) at scheduler.ipp:476:16
    frame #3: 0x0000aaaae62da254 yb-master`boost::asio::detail::scheduler::run(this=0x000024677fc34110, ec=0x0000ffff9ec36ba0) at scheduler.ipp:210:10
    frame #4: 0x0000aaaae62d9ab4 yb-master`yb::rpc::IoThreadPool::Impl::Execute() [inlined] boost::asio::io_context::run(this=<unavailable>, ec=0x0000ffff9ec36ba0) at io_context.ipp:71:16
    frame #5: 0x0000aaaae62d9ab0 yb-master`yb::rpc::IoThreadPool::Impl::Execute(this=<unavailable>) at io_thread_pool.cc:76:17
    frame #6: 0x0000aaaae6b99898 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:v170002](this=0x000024677fdb63f0) const at function.h:517:16
    frame #7: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000024677fdb63f0) const at function.h:1168:12
    frame #8: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(arg=0x000024677fdb6390) at thread.cc:853:3
    frame #9: 0x0000ffffa17278b8 libpthread.so.0`start_thread + 392
    frame #10: 0x0000ffffa1783afc libc.so.6`thread_start + 12
  thread #5, stop reason = signal 0
    frame #0: 0x0000aaaae574f7d8 yb-master`yb::docdb::(anonymous namespace)::DocDBCompactionFeed::Feed(yb::Slice const&, yb::Slice const&) [inlined] yb::EncodedDocHybridTime::AsSlice(this=0x0000ffff80954910) const at doc_hybrid_time.h:56:34
    frame #1: 0x0000aaaae574f7d8 yb-master`yb::docdb::(anonymous namespace)::DocDBCompactionFeed::Feed(yb::Slice const&, yb::Slice const&) [inlined] yb::operator<=>(lhs=0x0000ffff80954910, rhs=0x0000ffff809548f0) at doc_hybrid_time.h:93:32
    frame #2: 0x0000aaaae574f7d4 yb-master`yb::docdb::(anonymous namespace)::DocDBCompactionFeed::Feed(this=0x000024677502c000, internal_key=0x000024676716d270, value=<unavailable>) at docdb_compaction_context.cc:871:30
    frame #3: 0x0000aaaae6160a94 yb-master`rocksdb::CompactionJob::ProcessKeyValueCompaction(this=<unavailable>, holder=0x0000ffff80955700, sub_compact=0x0000246772ca81a0) at compaction_job.cc:726:35
    frame #4: 0x0000aaaae615ea8c yb-master`rocksdb::CompactionJob::Run(this=<unavailable>) at compaction_job.cc:482:3
    frame #5: 0x0000aaaae618bfb0 yb-master`rocksdb::DBImpl::BackgroundCompaction(this=0x0000246763f37f80, made_progress=<unavailable>, job_context=0x0000ffff809568e8, log_buffer=<unavailable>, manual_compaction=<unavailable>, compaction=<unavailable>) at db_impl.cc:3784:29
    frame #6: 0x0000aaaae618a6c0 yb-master`rocksdb::DBImpl::BackgroundCallCompaction(this=0x0000246763f37f80, m=0x0000000000000000, compaction=nullptr, compaction_task=<unavailable>) at db_impl.cc:3510:32
    frame #7: 0x0000aaaae619b1f0 yb-master`rocksdb::DBImpl::CompactionTask::DoRun(this=<unavailable>, suspender=<unavailable>) at db_impl.cc:319:15
    frame #8: 0x0000aaaae6b73460 yb-master`yb::(anonymous namespace)::PriorityThreadPoolWorker::Run(this=0x000024677978c128) at priority_thread_pool.cc:286:26
    frame #9: 0x0000aaaae6b99898 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:v170002](this=0x0000246778358b10) const at function.h:517:16
    frame #10: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x0000246778358b10) const at function.h:1168:12
    frame #11: 0x0000aaaae6b99884 yb-master`yb::Thread::SuperviseThread(arg=0x0000246778358ab0) at thread.cc:853:3
    frame #12: 0x0000ffffa17278b8 libpthread.so.0`start_thread + 392
    frame #13: 0x0000ffffa1783afc libc.so.6`thread_start + 12
  thread #6, stop reason = signal 0

G-flags:

tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "tablet_split_high_phase_shard_count_per_node": 20000,
                "tablet_split_high_phase_size_threshold_bytes": 2097152,  # 2MB
                # low_phase_size 100KB
                "tablet_split_low_phase_size_threshold_bytes": 102400,  # 100 KB
                "tablet_split_low_phase_shard_count_per_node": 10000,
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "allowed_preview_flags_csv": "ysql_enable_db_catalog_version_mode",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

shishir2001-yb commented 6 months ago

cc: @druzac

druzac commented 6 months ago

The collector's call to GetClusterConfig gets a nullptr when it tries to read CatalogManager::cluster_config_ [0]. According to a comment [1], code that calls GetClusterConfig is supposed to hold a scoped shared leader lock while doing so, but the collector code doesn't. The catalog reload code explicitly nulls out this value[2], so there's race here. Any test which repeatedly does a point-in-time-restore is more likely to trigger this race. But as far as I can tell it's been in the code for years.

[0]

    frame #6: 0x0000aaaae5ad57e8 yb-master`yb::master::CatalogManager::GetClusterConfig(yb::master::SysClusterConfigEntryPB*) [inlined] yb::master::MetadataCowWrapper<yb::master::PersistentClusterConfigInfo>::LockForRead(this=0x0000000000000000) const at catalog_entity_base.h:84:41
    frame #7: 0x0000aaaae5ad57e4 yb-master`yb::master::CatalogManager::GetClusterConfig(this=<unavailable>, config=0x0000ffff88976798) at catalog_manager.cc:12480:28
    frame #8: 0x0000aaaae5c80900 yb-master`yb::master::BasicCollector::Collect(this=0x000024677f8d4440, collection_level=MEDIUM) at master_call_home.cc:37:48

[1] https://github.com/yugabyte/yugabyte-db/blob/7ba576df60d0a39068c5c949ba9300430127b4aa/src/yb/master/catalog_manager.h#L2306-L2318

[2] https://github.com/yugabyte/yugabyte-db/blob/7ba576df60d0a39068c5c949ba9300430127b4aa/src/yb/master/catalog_manager.cc#L1324