no cluster，cluster_mrg and node_mgr standing for one night,clustermgr no heartbeat,node_mgr no log output - Githubissues

zettadb / cluster_mgr

Clust_mgr is an important compnent of KunlunBase. It provides a HTTP API for KunlunBase users to do cluster management, provisioning and monitor work, so that uses can install a cluster, a kunlun-server node, a storage shard or a kunlun-storage node by calling such APIs. Such capability enables users to integrate KunlunBase management and provisioning as part of their existing application or GUIs. Cluster_mgr also provide other important cluster maintenance background work to make sure the KunlunBase clusters it serves can work efficiently and reliably.

http://www.kunlunbase.com

Apache License 2.0

10 stars 2 forks source link

no cluster，cluster_mrg and node_mgr standing for one night,clustermgr no heartbeat,node_mgr no log output #25

Open jd-zhang opened 2 years ago

jd-zhang commented 2 years ago

Issue migrated from trac ticket # 705

component: cluster manager | priority: major

2022-05-18 11:13:00: hellen@zettadb.com created the issue

1.log of cluster mgr Wed May 18 10:50:40 2022 tid:0x5e6b [INFO] [/home/kunlun/program_binaries/test_rbr/cluster_mgr_0513/src/http_server/http_server.cc:300 GenerateRequest]: Http post: { "version":"1.0", "job_id":"", "job_type":"create_cluster", "user_name":"kunlun_test", "timestamp":"202205131532", "paras":{ "nick_name":"rbrcluster001", "ha_mode":"rbr", "shards":"2", "nodes":"3", "comps":"1", "max_storage_size":"20", "max_connections":"6", "cpu_cores":"8", "innodb_size":"1", "dbcfg":"1", "machinelist": [ {"hostaddr":"192.168.0.129"} ] } } 2. that time nodemgr no log output

jd-zhang commented 2 years ago

2022-05-18 11:14:15: hellen@zettadb.com commented

两个mgr都是启动状态，且元数据表里没有集群，静置一晚上。次日早上发了一条创建集群命令，cluster_mgr收到，但没有写到数据库cluster_general_job_log，当时研发以为可能是元数据表出错导致不能写入，但登录元数据主库可以写入，之后再次发送创建集群命令，就正常可以写入和启动创建集群动作了。

jd-zhang commented 2 years ago

2022-05-18 11:15:36: hellen@zettadb.com commented

创建集群的数据是这样的:{ "version":"1.0", "job_id":"", "job_type":"create_cluster", "user_name":"kunlun_test", "timestamp":"202205131532", "paras":{ "nick_name":"rbrcluster001", "ha_mode":"rbr", "shards":"2", "nodes":"3", "comps":"1", "max_storage_size":"20", "max_connections":"6", "cpu_cores":"8", "innodb_size":"1", "dbcfg":"1", "machinelist": [ {"hostaddr":"${node_mgr.1}"} ] } }

jd-zhang commented 2 years ago

2022-05-19 09:41:04: hellen@zettadb.com commented

18号晚上重现了这个问题，发送创建rbr集群，api返回： {"attachment":null,"error_code":"1","error_info":"execute query failed [this lead to connection closed]: , error number: 2006, sql: begin","status":"failed","version":"1.0"}

此时clustermgr只有这点日志： Thu May 19 09:36:40 2022 tid:0x5e63 [INFO] [/home/kunlun/program_binaries/test_rbr/cluster_mgr_0513/src/http_server/http_server.cc:300 GenerateRequest]: Http post: { "version":"1.0", "job_id":"", "job_type":"create_cluster", "user_name":"kunlun_test", "timestamp":"202205131532", "paras":{ "nick_name":"rbrcluster002", "ha_mode":"rbr", "shards":"2", "nodes":"3", "comps":"1", "max_storage_size":"20", "max_connections":"6", "cpu_cores":"8", "innodb_size":"1", "dbcfg":"1", "machinelist": [ {"hostaddr":"192.168.0.129"} ] } }

此时nodemgr没有日志。

jd-zhang commented 2 years ago

2022-05-19 10:10:56: @chaojie1979 commented

应该是写数据库的连接断了，后面增加重试机制

jd-zhang commented 2 years ago

2022-05-20 10:16:59: @chaojie1979 commented

zettalib里面增加重试机制，之前接口通过statement_retries配置重试次数

jd-zhang commented 2 years ago

2022-05-20 10:16:59: @chaojie1979 changed owner from chaojie to snow

jd-zhang commented 2 years ago

2022-05-20 10:19:16: hellen@zettadb.com commented

第三次重现，clustermgr输出这样的打印：Fri May 20 09:46:35 2022 tid:0xf1f61 [ERROR] [/home/kunlun/program_binaries/test_rbr/cluster_mgr_0513/src/http_server/http_server.cc:341 GenerateRequestUniqueId]: execute query failed [this lead to connection closed]: , error number: 2006, sql: begin