Open fvigotti opened 5 years ago
another element that maybe can help...
when from the non working node , I stop pinging the non-reachable VIP, the arp table remove the (incomplete)
problem and show the correct HWADDRESS .... :sob:
root@bca-1:~# arp | grep "40.31"
10.34.40.31 ether a2:31:cb:dc:3e:98 C weave
root@bca-1:~# arp | grep "40.31"
10.34.40.31 ether a2:31:cb:dc:3e:98 C weave
# --- now I start again to ping 10.34.40.31 .... and ...
root@bca-1:~# arp | grep "40.31"
10.34.40.31 (incomplete) weave
root@bca-1:~# arp | grep "40.31"
10.34.40.31 (incomplete) weave
root@bca-1:~# arp | grep "40.31"
2 nodes in the cluster cannot reach the cluster dns VIP, I can ping the service VIP or the pod VIP other nodes, but those two cannot
@fvigotti Are you able to access the pod IP of the DNS service directly from the nodes where you are seeing the problem? or the problem is only through accessing the service VIP?
From the problem node accessing any other service work fine?
@murali-reddy from the node where there is the problem & from the pod on those nodes I cannot access the VIP of POD && I cannot access the VIP of the service ( maybe the problem is that it redirect to that pod.. I don't know how to check connection to the service ip before the redirection take place ) , I can access the public ip of the node where coredns pod is deployed... from every node ( included those with the problem with VIPs )
IMPORTANT anyway.. further testing inspired by your message .. evidenced that ALL the pod VIPs deployed on ONE specific node ( the one where there is coredns ) are not reachable from those two problematic nodes ( and all pod deployed there ) ( those with the arp incomplete
)
because the problematic nodes are two and the "unreachable node" ( unreachables are the pods VIPs on that node ) is only one, I'll then restart the weave on one of the two problematic nodes and post if the issue fixes on that node.. so if the problem is the "weave server" or "weave client" but before that I need help to investigate about what may be the cause, about the discoverability of the issue, I can just ping a globally distributed pod/daemonset that doesn't use hostnetwork to their VIPs and with failures in ping I can discover those weave services that are failing...
I want to add that new pods deployed on the node that is "unreachable" are unreachable too on their VIPs ( from those two problematic nodes )
I have had to restart those nodes to bring some services up ( this is a production cluster ) anyway I've restarted the weave pod in one of the node that cannot ping, after ~20 seconds it was able to ping the previously unreachable VIPs
then I've restarted weave on the node whose VIPs were not reachable... after ~30 seconds the non-restarted problematic node.. started working again.. so the problem has been fixed either restarting the "src" OR "dst" weave of those unreachable VIPs...
@fvigotti I assume your cluster is stable now, but if you run into issue again, for troubleshooting please check if nodes are able to establish connections by running weave status connections
and weave status peers
on both the nodes which can not talk to each other
@murali-reddy I won't call it stable.. because this problem happened twice (in 3 months of those nodes uptime ) and I don't know what was the cause and also at the moment the auto-healing is a hacked crontab job on all the nodes .. with something like this :
ips=`kubectl get pods --namespace=kube-system -o=jsonpath="{..status.podIP}" -l name=weave-monitor-pong`
for ip in $ips; do
echo "ip : $ip"
ping $ip -c 1 2>&1 >/dev/null
if [ $? = 0 ]; then
echo ok
else
echo ERROR
fi
done
to find problems then there it is a restart logic..
as I said in my first post those commands show 0 errors/issues.. ( their output is the same as it is now )
all connections was established encrypted fastdp
even to/from the node whose VIPs was unreachable..
... just a small difference.. with weave status peers the arrows from the problematic node was some right sided some left.. now are all poiting to the right
weave 2.5.0 kube 1.12.2 Docker version 18.03.1-ce, build 9ee9f40 ubuntu 18.04 kernel : 4.15.0-43-generic cluster of ~15 nodes
after a couple of months of runtime whole production cluster went down some weeks ago, I was more worried to bring everything up than debugging but a fast inspection of the issue then showed that cluster dns service was unreachable also inspection on pods ( with weave network ) or hosts showed same problem, nodes can ping eachother but some VIP was unreachable from nodes..
restarted all weave pods , everything started working again.. now inspecting issues on some zookeeper nodes I've found that again... 2 nodes in the cluster cannot reach the cluster dns VIP, I can ping the service VIP or the pod VIP other nodes, but those two cannot
weave logs show nothing strange/different between hosts.. nor those commands..
iptables seems fine
iptables -t nat -nvL | grep '10.34.40.31' 0 0 KUBE-MARK-MASQ all -- 10.34.40.31 0.0.0.0/0
0 0 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp to:10.34.40.31:53 0 0 KUBE-MARK-MASQ all -- 10.34.40.31 0.0.0.0/0
0 0 DNAT udp -- 0.0.0.0/0 0.0.0.0/0 udp to:10.34.40.31:53
10.34.40.31 (incomplete) weave
10.34.40.31 ether a2:31:cb:dc:3e:98 C weave