zettadb / cluster_mgr

Clust_mgr is an important compnent of KunlunBase. It provides a HTTP API for KunlunBase users to do cluster management, provisioning and monitor work, so that uses can install a cluster, a kunlun-server node, a storage shard or a kunlun-storage node by calling such APIs. Such capability enables users to integrate KunlunBase management and provisioning as part of their existing application or GUIs. Cluster_mgr also provide other important cluster maintenance background work to make sure the KunlunBase clusters it serves can work efficiently and reliably.
http://www.kunlunbase.com
Apache License 2.0
10 stars 2 forks source link

cluster_mgr crash during the call to get_cluster_detail #33

Open jd-zhang opened 2 years ago

jd-zhang commented 2 years ago

Issue migrated from trac ticket # 798

component: cluster manager | priority: major

2022-06-06 17:19:11: zhangjindong@zettadb.com created the issue


stack:

(gdb) p this
$1 = (kunlun::PGConnection * const) 0x0
(gdb) bt
#0  0x000000000066c458 in kunlun::PGConnection::CheckIsConnected (this=0x0)
    at /home/kunlun/zettalib/src/op_pg/op_pg.cc:60
#1  0x0000000000500f92 in Computer_node::connect_status (this=0x38cd230)
    at /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/kl_cluster.h:174
#2  0x00000000004fb270 in System::get_cluster_detail (this=0x382f5b0, paras=..., attachment=...)
    at /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/sys.cc:1926
#3  0x000000000061ee25 in kunlun::SyncMission::GetClusterDetail (this=0x7f09e0049f50)
    at /home/kunlun/debugbuild/cluster_mgr/src/sync_mission/sync_mission.cc:190
#4  0x000000000052a7e1 in kunlun::SyncMission::SyncTaskImpl (this=0x7f09e0049f50)
    at /home/kunlun/debugbuild/cluster_mgr/src/sync_mission/sync_mission.h:52
#5  0x0000000000529f4d in MissionRequest::SetUpSyncTaskImpl (this=0x7f09e0049f50)
    at /home/kunlun/debugbuild/cluster_mgr/src/request_framework/missionRequest.h:24
#6  0x000000000052681c in HttpServiceImpl::Emit (this=0x387bf30, cntl_base=0x7f09e0049ba0,
    request=0x7f09e0046570, response=0x7f09e0047df0, done=0x7f09dc053380)

This happens for first call of get_cluster_detail right after create_cluster operation, since after create_cluster, the cluster->computer_nodes does not connect to the pg nodes, so gpsqlconn is still null pointer.

jd-zhang commented 2 years ago

2022-06-08 14:01:06: zhangjindong@zettadb.com

jd-zhang commented 2 years ago

2022-06-08 14:01:06: zhangjindong@zettadb.com commented


Another stack:

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
[Current thread is 1 (Thread 0x7f1af57fa700 (LWP 14969))]
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f1b5563e859 in __GI_abort () at abort.c:79
#2  0x00007f1b556a926e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f1b557d3298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f1b556b12fc in malloc_printerr (str=str@entry=0x7f1b557d55d0 "free(): double free detected in tcache 2") at malloc.c:5347
#4  0x00007f1b556b2f6d in _int_free (av=0x7f1aec000020, p=0x7f1aec001b90, have_lock=0) at malloc.c:4201
#5  0x0000000000adc5b9 in mysql_free_result (result=0x7f1abc002c40) at /home/kunlun/zettalib/src/vendor/mariadb-10.6.7/libmariadb/libmariadb/mariadb_lib.c:597
#6  0x000000000054a7e1 in kunlun_rbr::CAsyncMysqlManager::MysqlSocketResult (this=0x7f1af0000b60, cmysql=0x7f1af001b560, event=1) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:843
#7  0x000000000054b525 in operator() (__closure=0x7f1aec0022f8) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:1087
#8  0x0000000000552a13 in std::__invoke_impl<void, kunlun_rbr::CAsyncMysqlManager::run()::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:60
#9  0x0000000000552614 in std::__invoke<kunlun_rbr::CAsyncMysqlManager::run()::<lambda()>&>(struct {...} &) (__fn=...) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:95
#10 0x0000000000552098 in std::_Bind<kunlun_rbr::CAsyncMysqlManager::run()::<lambda()>()>::__call<void>(std::tuple<> &&, std::_Index_tuple<>) (this=0x7f1aec0022f8, __args=...) at /opt/rh/devtoolset-10/root/usr/include/c++/10/functional:416

And

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f1b5563e859 in __GI_abort () at abort.c:79
#2  0x00007f1b556a926e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f1b557d3298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f1b556b12fc in malloc_printerr (str=str@entry=0x7f1b557d5628 "double free or corruption (fasttop)") at malloc.c:5347
#4  0x00007f1b556b2c65 in _int_free (av=0x7f1ae8000020, p=0x7f1ae8001b90, have_lock=0) at malloc.c:4266
#5  0x0000000000adc5b9 in mysql_free_result (result=0x7f1abc002f50) at /home/kunlun/zettalib/src/vendor/mariadb-10.6.7/libmariadb/libmariadb/mariadb_lib.c:597
#6  0x000000000054a445 in kunlun_rbr::CAsyncMysqlManager::RetryFetchRow (this=0x7f1af4000b60, cmysql=0x7f1af4032950) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:792
#7  0x0000000000549bc4 in kunlun_rbr::CAsyncMysqlManager::ParseMysqlResult (this=0x7f1af4000b60, cmysql=0x7f1af4032950) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:671
#8  0x000000000054a91e in kunlun_rbr::CAsyncMysqlManager::MysqlSocketResult (this=0x7f1af4000b60, cmysql=0x7f1af4032950, event=1) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:857
#9  0x000000000054b525 in operator() (__closure=0x7f1ae8002398) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:1087
#10 0x0000000000552a13 in std::__invoke_impl<void, kunlun_rbr::CAsyncMysqlManager::run()::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/invoke.h:60

And

[Current thread is 1 (Thread 0x7f1af17fa700 (LWP 15399))]
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f1b5563e859 in __GI_abort () at abort.c:79
#2  0x00007f1b556a926e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f1b557d3298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f1b556b12fc in malloc_printerr (str=str@entry=0x7f1b557d55d0 "free(): double free detected in tcache 2") at malloc.c:5347
#4  0x00007f1b556b2f6d in _int_free (av=0x7f1aec000020, p=0x7f1aec001ac0, have_lock=0) at malloc.c:4201
#5  0x000000000055c298 in std::__future_base::_Result<void>::~_Result (this=0x7f1aec001ad0, __in_chrg=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/future:658
#6  0x000000000055339a in std::__future_base::_Result<void>::_M_destroy (this=0x7f1aec001ad0) at /opt/rh/devtoolset-10/root/usr/include/c++/10/future:663
#7  0x0000000000552dc7 in std::__future_base::_Result_base::_Deleter::operator() (this=0x7f1aec002388, __fr=0x7f1aec001ad0) at /opt/rh/devtoolset-10/root/usr/include/c++/10/future:213
#8  0x0000000000553e78 in std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>::~unique_ptr (this=0x7f1aec002388, __in_chrg=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/unique_ptr.h:361
#9  0x000000000055c068 in std::__future_base::_State_baseV2::~_State_baseV2 (this=0x7f1aec002380, __in_chrg=<optimized out>) at /opt/rh/devtoolset-10/root/usr/include/c++/10/future:328

And

[Current thread is 1 (Thread 0x7f1af27fc700 (LWP 15604))]
#0  tcache_get (tc_idx=<optimized out>) at malloc.c:2937
#1  __GI___libc_malloc (bytes=8) at malloc.c:3051
#2  0x00007f1b55a24b39 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x000000000055a6ae in __gnu_cxx::new_allocator<kunlun_rbr::CRowResult*>::allocate (this=0x7f1af8032a20, __n=1) at /opt/rh/devtoolset-10/root/usr/include/c++/10/ext/new_allocator.h:115
#4  0x0000000000559dbc in std::allocator_traits<std::allocator<kunlun_rbr::CRowResult*> >::allocate (__a=..., __n=1) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/alloc_traits.h:460
#5  0x00000000005588a0 in std::_Vector_base<kunlun_rbr::CRowResult*, std::allocator<kunlun_rbr::CRowResult*> >::_M_allocate (this=0x7f1af8032a20, __n=1) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/stl_vector.h:346
#6  0x0000000000556648 in std::vector<kunlun_rbr::CRowResult*, std::allocator<kunlun_rbr::CRowResult*> >::_M_realloc_insert<kunlun_rbr::CRowResult*&> (this=0x7f1af8032a20, __position=non-dereferenceable iterator for std::vector) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/vector.tcc:440
#7  0x0000000000554fba in std::vector<kunlun_rbr::CRowResult*, std::allocator<kunlun_rbr::CRowResult*> >::emplace_back<kunlun_rbr::CRowResult*&> (this=0x7f1af8032a20) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/vector.tcc:121
#8  0x0000000000549731 in kunlun_rbr::AsyncMysqlResult::ParseResult (this=0x7f1af8032a10, result=0x7f1ac8002cc0, row=@0x7f1af8032978: 0x7f1aec001af0) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:594
#9  0x0000000000549bab in kunlun_rbr::CAsyncMysqlManager::ParseMysqlResult (this=0x7f1af8000b60, cmysql=0x7f1af8032950) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:668
#10 0x000000000054a91e in kunlun_rbr::CAsyncMysqlManager::MysqlSocketResult (this=0x7f1af8000b60, cmysql=0x7f1af8032950, event=1) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:857

And

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
[Current thread is 1 (Thread 0x7f1af2ffd700 (LWP 15739))]
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f1b5563e859 in __GI_abort () at abort.c:79
#2  0x00007f1b556a926e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f1b557d3298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f1b556b12fc in malloc_printerr (str=str@entry=0x7f1b557d5ad8 "malloc(): unsorted double linked list corrupted") at malloc.c:5347
#4  0x00007f1b556b42ec in _int_malloc (av=av@entry=0x7f1abc000020, bytes=bytes@entry=69) at malloc.c:3744
#5  0x00007f1b556b6299 in __GI___libc_malloc (bytes=69) at malloc.c:3066
#6  0x00007f1b55a24b39 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
jd-zhang commented 2 years ago

2022-06-08 14:45:40: snow@zettadb.com changed owner from snow to barney

jd-zhang commented 2 years ago

2022-06-09 15:51:16: zhangjindong@zettadb.com commented


new stack:

(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x100000053) at ../nptl/pthread_mutex_lock.c:67
#1  0x00000000004416b5 in __gthread_mutex_lock (__mutex=0x100000053) at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#2  0x0000000000442e82 in std::mutex::lock (this=0x100000053) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#3  0x0000000000450cd4 in std::lock_guard<std::mutex>::lock_guard (this=0x7ffe95b79908, __m=...) at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:159
#4  0x00000000004b5257 in Shard_node::send_stmt (this=0x100000003, Python Exception <class 'gdb.error'> No type named class std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep.:
stmt=, result=0x7ffe95b7a5d0, nretries=2) at /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/shard.cc:515
#5  0x000000000052f113 in GlobalNodeChannelManager::initNodeChannelMap (this=0x24e6520) at /home/kunlun/debugbuild/cluster_mgr/src/http_server/node_channel.cc:76
#6  0x000000000052f8fe in GlobalNodeChannelManager::Init (this=0x24e6520) at /home/kunlun/debugbuild/cluster_mgr/src/http_server/node_channel.cc:139
#7  0x0000000000441c56 in main (argc=2, argv=0x7ffe95b7c008) at /home/kunlun/debugbuild/cluster_mgr/src/main.cc:102
(gdb) frame 4
#4  0x00000000004b5257 in Shard_node::send_stmt (this=0x100000003, stmt="SELECT * FROM `kunlun_metadata_db`.`server_nodes`", result=0x7ffe95b7a5d0, nretries=2)
    at /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/shard.cc:515
515     /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/shard.cc: No such file or directory.
(gdb) p sql_mux
Cannot access memory at address 0x100000053

From the shard.h and shard.cc, sql_mux seems not be initialized.

jd-zhang commented 2 years ago

2022-06-14 17:43:19: zhangjindong@zettadb.com commented


another stack about double free:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fbd7c54e859 in __GI_abort () at abort.c:79
#2  0x00007fbd7c5b926e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fbd7c6e3298 "%s\n")
    at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007fbd7c5c12fc in malloc_printerr (str=str@entry=0x7fbd7c6e5628 "double free or corruption (fasttop)")
    at malloc.c:5347
#4  0x00007fbd7c5c2c65 in _int_free (av=0x7fbcf0000020, p=0x7fbcf0002b80, have_lock=0) at malloc.c:4266
#5  0x0000000000aeb519 in mysql_free_result (result=0x7fbcf00036c0)
    at /home/kunlun/zettalib/src/vendor/mariadb-10.6.7/libmariadb/libmariadb/mariadb_lib.c:597
#6  0x0000000000549183 in kunlun_rbr::SendNextPendingSql (cmysql=0x7fbd18060920)
    at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:210
#7  0x000000000054b935 in kunlun_rbr::CAsyncMysqlManager::RetryFetchRow (this=0x7fbd18000b60,
    cmysql=0x7fbd18060920) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:793
#8  0x000000000054b0b8 in kunlun_rbr::CAsyncMysqlManager::ParseMysqlResult (this=0x7fbd18000b60,
    cmysql=0x7fbd18060920) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:671
#9  0x000000000054bdef in kunlun_rbr::CAsyncMysqlManager::MysqlSocketResult (this=0x7fbd18000b60,
    cmysql=0x7fbd18060920, event=1) at /home/kunlun/debugbuild/cluster_mgr/src/cluster_rbr/async_mysql.cc:857
#10 0x000000000054c9e3 in operator() (__closure=0x7fbd10002548)
jd-zhang commented 2 years ago

2022-06-17 11:10:38: zhangjindong@zettadb.com commented


A new stack:

(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x50) at ../nptl/pthread_mutex_lock.c:67
#1  0x00000000004416b5 in __gthread_mutex_lock (__mutex=0x50)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/x86_64-redhat-linux/bits/gthr-default.h:749
#2  0x0000000000442e82 in std::mutex::lock (this=0x50)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:100
#3  0x0000000000450cd4 in std::lock_guard<std::mutex>::lock_guard (this=0x7fffb332e298, __m=...)
    at /opt/rh/devtoolset-10/root/usr/include/c++/10/bits/std_mutex.h:159
#4  0x00000000004b525d in Shard_node::send_stmt (this=0x0, Python Exception <class 'gdb.error'> No type named class std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep.:
stmt=, result=0x7fffb332e430, nretries=2)
    at /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/shard.cc:515
#5  0x00000000004bb611 in MetadataShard::refresh_shards (this=0x258ed90,
    kl_clusters=std::vector of length 0, capacity 0)
    at /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/shard.cc:1692
#6  0x00000000004f2e81 in System::refresh_shards_from_metadata_server (this=0x258ed90)
    at /home/kunlun/debugbuild/cluster_mgr/src/kl_mentain/sys.cc:335
#7  0x0000000000441e41 in main (argc=2, argv=0x7fffb332fd78)
    at /home/kunlun/debugbuild/cluster_mgr/src/main.cc:122