wind-c / comqtt

A lightweight, high-performance go mqtt server(v3.0|v3.1.1|v5.0) supporting distributed cluster
MIT License
877 stars 50 forks source link

关于跨机器建立集群问题,两台外网机器,进行跨机器建立集群,已经joined,就提示没有Suspect co-001 has failed, no acks received #9

Closed skygp closed 1 year ago

skygp commented 1 year ago

目前就两台外网机器,进行跨机器建立集群

其中一台 conf.yml:

cluster: node-name: co-009 #The name of this node. This must be unique in the cluster.If nodename is not set, use the local hostname. bind-port: 1886 #The port is used for both UDP and TCP gossip.Used for member discovery and communication. members: localhost:1886,172.17.224.50:1886,172.17.224.51:1886 #seeds member list, format such as 192.168.0.103:7946,192.168.0.104:7946 queue-depth: 1024000 #size of Memberlist's internal channel which handles UDP messages. raft-port: 1887 raft-dir: ./raft/node1

另中一台的conf.yml:

cluster: node-name: co-001 #The name of this node. This must be unique in the cluster.If nodename is not set, use the local hostname. bind-port: 1886 #The port is used for both UDP and TCP gossip.Used for member discovery and communication. members: localhost:1886,172.17.224.50:1886,172.17.224.51:1886 #seeds member list, format such as 192.168.0.103:7946,192.168.0.104:7946 queue-depth: 1024000 #size of Memberlist's internal channel which handles UDP messages. raft-port: 1887 raft-dir: ./raft/node1

就是我两台机器配置的members,都填一样的ip

报错信息:

A node has joined: co-009 A node has joined: co-001 Local member 172.17.224.50:1886 gogo BootstrapRaft 2022-10-11T10:20:51.812+0800 [INFO] raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:co-009 Address:172.17.224.50:1887}]" 2022-10-11T10:20:51.812+0800 [INFO] raft: entering follower state: follower="Node at 172.17.224.50:1887 [Follower]" leader-address= leader-id= Cluster Node Created! Mqtt Server Started!
2022-10-11T10:20:53.139+0800 [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id= 2022-10-11T10:20:53.140+0800 [INFO] raft: entering candidate state: node="Node at 172.17.224.50:1887 [Candidate]" term=7 2022-10-11T10:20:53.146+0800 [INFO] raft: election won: tally=1 2022-10-11T10:20:53.146+0800 [INFO] raft: entering leader state: leader="Node at 172.17.224.50:1887 [Leader]" 2022/10/11 10:20:53 [INFO] memberlist: Suspect co-001 has failed, no acks received 2022/10/11 10:20:56 [INFO] memberlist: Suspect co-001 has failed, no acks received 2022/10/11 10:20:57 [INFO] memberlist: Marking co-001 as failed, suspect timeout reached (0 peer confirmations) A node has left: co-001 2022/10/11 10:21:00 [INFO] memberlist: Suspect co-001 has failed, no acks received

有加入joined,就是没有回应,不知道是不是配置文件错了, 看了是内部代码报的错,如果是我的conf.yml配置错了,希望作者给个正确conf.yml配置方式,感谢

skygp commented 1 year ago

我是公网IP配的members,然后打印出来是内网ip

wind-c commented 1 year ago

把配置文件members中localhost:1886去掉试试

skygp commented 1 year ago

跨集群我再试试哈,感谢。最近测试还遇到一个问题,在单机集群中,刚开始部署上去时,在:1883端口监听topic,:1884端口发送信息,这刚开始是能接受到消息的。但又不知道出了什么情况,:1883就不能收到信息了。但我在mac电脑中部署集群是ok,放上linux服务器就不行,重新编译和替换程序和配置都没效果,我看是发送消息到:1884端口过去了,:1883端口就是没有收到信息

skygp commented 1 year ago

就是放到linux上执行部署集群就会报panic,而且集群中节点同步数据就不行,:1883监听topic,没有接到数据,但确实给1884端口发送消息了。

skygp commented 1 year ago

2022/10/13 11:03:38 worker exits from a panic: runtime error: invalid memory address or nil pointer dereference 2022/10/13 11:03:38 worker exits from panic: goroutine 42 [running]: github.com/panjf2000/ants/v2.(*goWorker).run.func1.1() /usr/local/bin/pkg/mod/github.com/panjf2000/ants/v2@v2.4.8/worker.go:58 +0x10c

这个panic主要是这里报的错,用了nil执行方法,需要调BootstrapRaft()之后才能执行以下代码 for i := 0; i < gps; i++ { c.inPool.Submit(c.processInboundMsg) }

wind-c commented 1 year ago

你拉最新代码,我昨天加了集群参数bind-addr,这个参数可以设定为内网ip,不能用localhost,members中也用内网ip。集群节点间内网ip通信。我在云上用三台Linux centos机器跑测正常。参考配置:1.jpg2.jpg3.jpg

skygp commented 1 year ago

谢谢大佬,我还没测出跨集群效果,单集群效果ok的,请教一个问题,你连接redis有没有出现Error: Connection reset by peer?因为跨集群只用一个redis服务,我从节点连主redis服务一直报 Connection reset by peer这个错误。。。搞了好久。感觉就差这一步了。