weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 669 forks source link

Weave down with: '[boltDB] Unable to open /weavedb/weave-netdata.db: timeout' #3246

Open dogopupper opened 6 years ago

dogopupper commented 6 years ago

What you expected to happen?

Weave to be working 🌮

What happened?

Cluster's docker version was downgraded from 17.06 to 1.12.6 a few days ago but seemed to be working normally, and it seems weave broke over the weekend.

logs of the weave container consist solely of: [boltDB] Unable to open /weavedb/weave-netdata.db: timeout

How to reproduce it?

no idea

Anything else we need to know?

k8s 1.8.7 on openstack docker 1.12.6 weave 2.2.0 coreOS stable 1632.2.1

Versions:

see above ^^

Logs:

weave logs: [boltDB] Unable to open /weavedb/weave-netdata.db: timeout

kubelet logs: bash[4519]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/499c90a7945d984024acf1fbc87fde11d9b0ee496c4ffe7d98c7509a7af2c001: dial tcp 127.0.0.1:6784: i/o timeout

Network:

Taken from a k8s node:

$ ip route

default via 10.118.0.1 dev eth0 proto dhcp src 10.118.10.110 metric 1024
10.118.0.0/18 dev eth0 proto kernel scope link src 10.118.10.110
10.118.0.1 dev eth0 proto dhcp scope link src 10.118.10.110 metric 1024
172.31.0.0/17 dev weave proto kernel scope link src 172.31.0.1
172.31.255.0/24 dev docker0 proto kernel scope link src 172.31.255.1 linkdown

$ ip -4 -o addr

1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 10.118.10.110/18 brd 10.118.63.255 scope global dynamic eth0\       valid_lft 85496sec preferred_lft 85496sec
3: docker0    inet 172.31.255.1/24 scope global docker0\       valid_lft forever preferred_lft forever
20: weave    inet 172.31.0.1/17 brd 172.31.127.255 scope global weave\       valid_lft forever preferred_lft forever

$ sudo iptables-save

# Generated by iptables-save v1.4.21 on Mon Feb 26 10:48:08 2018
*mangle
:PREROUTING ACCEPT [102640:151203753]
:INPUT ACCEPT [58344:140882171]
:FORWARD ACCEPT [44296:10321582]
:OUTPUT ACCEPT [59065:96962763]
:POSTROUTING ACCEPT [103143:107271265]
:WEAVE-IPSEC-IN - [0:0]
:WEAVE-IPSEC-IN-MARK - [0:0]
:WEAVE-IPSEC-OUT - [0:0]
:WEAVE-IPSEC-OUT-MARK - [0:0]
-A INPUT -j WEAVE-IPSEC-IN
-A OUTPUT -j WEAVE-IPSEC-OUT
-A WEAVE-IPSEC-IN-MARK -j MARK --set-xmark 0x20000/0x20000
-A WEAVE-IPSEC-OUT-MARK -j MARK --set-xmark 0x20000/0x20000
COMMIT
# Completed on Mon Feb 26 10:48:08 2018
# Generated by iptables-save v1.4.21 on Mon Feb 26 10:48:08 2018
*nat
:PREROUTING ACCEPT [16:960]
:INPUT ACCEPT [16:960]
:OUTPUT ACCEPT [17:1138]
:POSTROUTING ACCEPT [17:1138]
:DOCKER - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-NODEPORTS - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SEP-2KXSWJ3G64R65RSP - [0:0]
:KUBE-SEP-3J6NXYWQBOXIKDXB - [0:0]
:KUBE-SEP-3MOV7KWFT6UOYARV - [0:0]
:KUBE-SEP-5H7MKIBPE32CQD3K - [0:0]
:KUBE-SEP-ABDII4QYMLMHPYA5 - [0:0]
:KUBE-SEP-DV6SJO4IQHTDYRZ5 - [0:0]
:KUBE-SEP-EH66SQ2S7KAF5AVN - [0:0]
:KUBE-SEP-EM4A7OIKMC6H2HNS - [0:0]
:KUBE-SEP-GJD4Q2SPT7EL6UD2 - [0:0]
:KUBE-SEP-HEYMHQVMDGZCYJZM - [0:0]
:KUBE-SEP-HNAQAA5N5P44RFQM - [0:0]
:KUBE-SEP-I63WX7TZ7WRRCAO3 - [0:0]
:KUBE-SEP-IFO4224EIEA4JSJJ - [0:0]
:KUBE-SEP-JMFPT52EVMG2AWSX - [0:0]
:KUBE-SEP-KX23O7ZWXBQIDYQQ - [0:0]
:KUBE-SEP-LWPDFQBPELW5WK2N - [0:0]
:KUBE-SEP-N4KWIQINASDGWGIJ - [0:0]
:KUBE-SEP-QFD7WIJ65TV7YG4S - [0:0]
:KUBE-SEP-QY5RVKWGFSN5VJ6B - [0:0]
:KUBE-SEP-R2RKL262VYLJJJ6A - [0:0]
:KUBE-SEP-RF5ORNUSDLCPBDFL - [0:0]
:KUBE-SEP-SBUZYW6REI5IHNXW - [0:0]
:KUBE-SEP-U23MNYABAPTDSZ3Z - [0:0]
:KUBE-SEP-V7HFTMXQJ4MBLXA6 - [0:0]
:KUBE-SEP-WQ3TYKXP7F26PYTE - [0:0]
:KUBE-SEP-WQCEDX6RSXGUJRIH - [0:0]
:KUBE-SEP-XJLLKZMUATQQRDBE - [0:0]
:KUBE-SEP-YFMETSEPEPL5MZTR - [0:0]
:KUBE-SERVICES - [0:0]
:KUBE-SVC-2H5WWL56CE74R6US - [0:0]
:KUBE-SVC-3YSE6J6DKVCBUJMI - [0:0]
:KUBE-SVC-4KXUCK74MO3GQCXU - [0:0]
:KUBE-SVC-4L43JJRJOXWXYSYV - [0:0]
:KUBE-SVC-4M4DNSF4I6KEKCAI - [0:0]
:KUBE-SVC-6G7NRX456DTKJAZT - [0:0]
:KUBE-SVC-6HTUXGVEB5TGYK3K - [0:0]
:KUBE-SVC-6SHOD27LJ4JHFMBU - [0:0]
:KUBE-SVC-AT4GZAEQMF4HLK53 - [0:0]
:KUBE-SVC-B3QNTUNNAMROEJGW - [0:0]
:KUBE-SVC-BAGAGJF3VCWDN7J4 - [0:0]
:KUBE-SVC-BJM46V3U5RZHCFRZ - [0:0]
:KUBE-SVC-DHUDXRTLNJWBORS6 - [0:0]
:KUBE-SVC-EABB5ZD3QVIYK33K - [0:0]
:KUBE-SVC-ERIFXISQEP7F7OF4 - [0:0]
:KUBE-SVC-F7IWZEEUI5ZAH7RS - [0:0]
:KUBE-SVC-FCXHW3UZM24LF5S3 - [0:0]
:KUBE-SVC-FPVLEE3QB77AOS4G - [0:0]
:KUBE-SVC-GJFMYARFU4V4XKG3 - [0:0]
:KUBE-SVC-HTZZPU24M3GCLVIG - [0:0]
:KUBE-SVC-I256EYRIZRR4EWBV - [0:0]
:KUBE-SVC-IB7EXPLWLWRICQP2 - [0:0]
:KUBE-SVC-IDSLTWPEKDYIGO53 - [0:0]
:KUBE-SVC-JFYOZBC5JJM6SW3E - [0:0]
:KUBE-SVC-KPEG4KXA2VRTBHTG - [0:0]
:KUBE-SVC-KV2GSBGFJ4SJNZ5U - [0:0]
:KUBE-SVC-KZ56ELFAUCIGRFV6 - [0:0]
:KUBE-SVC-KZIWI6ZSF2FW4XYS - [0:0]
:KUBE-SVC-NADYEGAMELJXYB22 - [0:0]
:KUBE-SVC-NPX46M4PTMTKRN6Y - [0:0]
:KUBE-SVC-NVTFCNULI3Y3YUAZ - [0:0]
:KUBE-SVC-OZQJFAV7V22GSG3X - [0:0]
:KUBE-SVC-POWXKCKDC3QGLZYQ - [0:0]
:KUBE-SVC-PQQ3MNWQ6ABTL4V3 - [0:0]
:KUBE-SVC-PXI6OZZMUNBHO43Q - [0:0]
:KUBE-SVC-RFEWFH5KXN3S5IUZ - [0:0]
:KUBE-SVC-TCOU7JCQXEZGVUNU - [0:0]
:KUBE-SVC-U3722Z4PISVTHSIP - [0:0]
:KUBE-SVC-X5DJKYMEOW7KWC7G - [0:0]
:KUBE-SVC-XHIIIAPK4PYSZYSE - [0:0]
:KUBE-SVC-XIG6MK4FMRNPQBKQ - [0:0]
:KUBE-SVC-YEKPHTHQKQIELXVJ - [0:0]
:KUBE-SVC-YN4PYMRMOF3OES5E - [0:0]
:KUBE-SVC-YNN6W32JE36JQY6C - [0:0]
:KUBE-SVC-YZZTCQRXVLATHIAK - [0:0]
:KUBE-SVC-ZI22QPPC6BY5KODT - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 172.31.255.0/24 ! -o docker0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A DOCKER -i docker0 -j RETURN
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-gcr:https" -m tcp --dport 30397 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-gcr:https" -m tcp --dport 30397 -j KUBE-SVC-YNN6W32JE36JQY6C
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-oceanreleases:https" -m tcp --dport 32515 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-oceanreleases:https" -m tcp --dport 32515 -j KUBE-SVC-YZZTCQRXVLATHIAK
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:scanningrig-freezer-wcs" -m tcp --dport 30000 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:scanningrig-freezer-wcs" -m tcp --dport 30000 -j KUBE-SVC-OZQJFAV7V22GSG3X
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-hub:https" -m tcp --dport 30266 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-hub:https" -m tcp --dport 30266 -j KUBE-SVC-4KXUCK74MO3GQCXU
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:https" -m tcp --dport 32443 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:https" -m tcp --dport 32443 -j KUBE-SVC-BAGAGJF3VCWDN7J4
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-quay:https" -m tcp --dport 30287 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-quay:https" -m tcp --dport 30287 -j KUBE-SVC-KZIWI6ZSF2FW4XYS
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-internal:https" -m tcp --dport 30560 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-internal:https" -m tcp --dport 30560 -j KUBE-SVC-F7IWZEEUI5ZAH7RS
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-ospcfc:https" -m tcp --dport 30090 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-ospcfc:https" -m tcp --dport 30090 -j KUBE-SVC-ZI22QPPC6BY5KODT
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:http" -m tcp --dport 32080 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:http" -m tcp --dport 32080 -j KUBE-SVC-6SHOD27LJ4JHFMBU
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-ocean:https" -m tcp --dport 32141 -j KUBE-MARK-MASQ
-A KUBE-NODEPORTS -p tcp -m comment --comment "kube-extra/registry-mirror-ocean:https" -m tcp --dport 32141 -j KUBE-SVC-4L43JJRJOXWXYSYV
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-SEP-2KXSWJ3G64R65RSP -s 172.31.24.16/32 -m comment --comment "kube-extra/registry-mirror-ocean:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-2KXSWJ3G64R65RSP -p tcp -m comment --comment "kube-extra/registry-mirror-ocean:https" -m tcp -j DNAT --to-destination 172.31.24.16:5000
-A KUBE-SEP-3J6NXYWQBOXIKDXB -s 172.31.56.5/32 -m comment --comment "gandalf/gandalf:" -j KUBE-MARK-MASQ
-A KUBE-SEP-3J6NXYWQBOXIKDXB -p tcp -m comment --comment "gandalf/gandalf:" -m tcp -j DNAT --to-destination 172.31.56.5:80
-A KUBE-SEP-3MOV7KWFT6UOYARV -s 172.31.16.18/32 -m comment --comment "gandalf/smtp:" -j KUBE-MARK-MASQ
-A KUBE-SEP-3MOV7KWFT6UOYARV -p tcp -m comment --comment "gandalf/smtp:" -m tcp -j DNAT --to-destination 172.31.16.18:25
-A KUBE-SEP-5H7MKIBPE32CQD3K -s 172.31.56.13/32 -m comment --comment "kube-extra/nginx-ingress-controller:scanningrig-freezer-wcs" -j KUBE-MARK-MASQ
-A KUBE-SEP-5H7MKIBPE32CQD3K -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:scanningrig-freezer-wcs" -m tcp -j DNAT --to-destination 172.31.56.13:30000
-A KUBE-SEP-ABDII4QYMLMHPYA5 -s 172.31.16.11/32 -m comment --comment "kube-system/heapster:" -j KUBE-MARK-MASQ
-A KUBE-SEP-ABDII4QYMLMHPYA5 -p tcp -m comment --comment "kube-system/heapster:" -m tcp -j DNAT --to-destination 172.31.16.11:8082
-A KUBE-SEP-DV6SJO4IQHTDYRZ5 -s 172.31.24.14/32 -m comment --comment "kufi/prometheus:web" -j KUBE-MARK-MASQ
-A KUBE-SEP-DV6SJO4IQHTDYRZ5 -p tcp -m comment --comment "kufi/prometheus:web" -m tcp -j DNAT --to-destination 172.31.24.14:9090
-A KUBE-SEP-EH66SQ2S7KAF5AVN -s 172.31.24.1/32 -m comment --comment "kube-extra/registry-mirror-gcr:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-EH66SQ2S7KAF5AVN -p tcp -m comment --comment "kube-extra/registry-mirror-gcr:https" -m tcp -j DNAT --to-destination 172.31.24.1:5000
-A KUBE-SEP-EM4A7OIKMC6H2HNS -s 172.31.24.11/32 -m comment --comment "jetbrainslicense/prometheus:web" -j KUBE-MARK-MASQ
-A KUBE-SEP-EM4A7OIKMC6H2HNS -p tcp -m comment --comment "jetbrainslicense/prometheus:web" -m tcp -j DNAT --to-destination 172.31.24.11:9090
-A KUBE-SEP-GJD4Q2SPT7EL6UD2 -s 172.31.28.11/32 -m comment --comment "gandalf/gandalf:" -j KUBE-MARK-MASQ
-A KUBE-SEP-GJD4Q2SPT7EL6UD2 -p tcp -m comment --comment "gandalf/gandalf:" -m tcp -j DNAT --to-destination 172.31.28.11:80
-A KUBE-SEP-HEYMHQVMDGZCYJZM -s 172.31.24.3/32 -m comment --comment "kube-extra/registry-mirror-hub:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-HEYMHQVMDGZCYJZM -p tcp -m comment --comment "kube-extra/registry-mirror-hub:https" -m tcp -j DNAT --to-destination 172.31.24.3:5000
-A KUBE-SEP-HNAQAA5N5P44RFQM -s 172.31.16.9/32 -m comment --comment "kube-e2etests-deployment-scale-service/kubee2etests:" -j KUBE-MARK-MASQ
-A KUBE-SEP-HNAQAA5N5P44RFQM -p tcp -m comment --comment "kube-e2etests-deployment-scale-service/kubee2etests:" -m tcp -j DNAT --to-destination 172.31.16.9:80
-A KUBE-SEP-I63WX7TZ7WRRCAO3 -s 172.31.16.15/32 -m comment --comment "ospcfccloudplatform/slack-proxy:web" -j KUBE-MARK-MASQ
-A KUBE-SEP-I63WX7TZ7WRRCAO3 -p tcp -m comment --comment "ospcfccloudplatform/slack-proxy:web" -m tcp -j DNAT --to-destination 172.31.16.15:8000
-A KUBE-SEP-IFO4224EIEA4JSJJ -s 172.31.28.12/32 -m comment --comment "kube-extra/default-http-backend:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-IFO4224EIEA4JSJJ -p tcp -m comment --comment "kube-extra/default-http-backend:http" -m tcp -j DNAT --to-destination 172.31.28.12:8080
-A KUBE-SEP-JMFPT52EVMG2AWSX -s 172.31.28.9/32 -m comment --comment "kube-monitoring/kube-state-metrics:http-metrics" -j KUBE-MARK-MASQ
-A KUBE-SEP-JMFPT52EVMG2AWSX -p tcp -m comment --comment "kube-monitoring/kube-state-metrics:http-metrics" -m tcp -j DNAT --to-destination 172.31.28.9:8080
-A KUBE-SEP-KX23O7ZWXBQIDYQQ -s 172.31.24.2/32 -m comment --comment "ospcfccloudplatform/alertmanager:http-metrics" -j KUBE-MARK-MASQ
-A KUBE-SEP-KX23O7ZWXBQIDYQQ -p tcp -m comment --comment "ospcfccloudplatform/alertmanager:http-metrics" -m tcp -j DNAT --to-destination 172.31.24.2:9093
-A KUBE-SEP-LWPDFQBPELW5WK2N -s 172.31.24.19/32 -m comment --comment "kube-extra/registry-mirror-gcr:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-LWPDFQBPELW5WK2N -p tcp -m comment --comment "kube-extra/registry-mirror-gcr:https" -m tcp -j DNAT --to-destination 172.31.24.19:5000
-A KUBE-SEP-N4KWIQINASDGWGIJ -s 172.31.56.18/32 -m comment --comment "screenful/postgres:postgres" -j KUBE-MARK-MASQ
-A KUBE-SEP-N4KWIQINASDGWGIJ -p tcp -m comment --comment "screenful/postgres:postgres" -m tcp -j DNAT --to-destination 172.31.56.18:5432
-A KUBE-SEP-QFD7WIJ65TV7YG4S -s 172.31.28.14/32 -m comment --comment "ospcfccloudplatform/pagerduty-proxy:web" -j KUBE-MARK-MASQ
-A KUBE-SEP-QFD7WIJ65TV7YG4S -p tcp -m comment --comment "ospcfccloudplatform/pagerduty-proxy:web" -m tcp -j DNAT --to-destination 172.31.28.14:8000
-A KUBE-SEP-QY5RVKWGFSN5VJ6B -s 172.31.16.12/32 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ
-A KUBE-SEP-QY5RVKWGFSN5VJ6B -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination 172.31.16.12:53
-A KUBE-SEP-R2RKL262VYLJJJ6A -s 172.31.16.13/32 -m comment --comment "kube-extra/kubernetes-dashboard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-R2RKL262VYLJJJ6A -p tcp -m comment --comment "kube-extra/kubernetes-dashboard:" -m tcp -j DNAT --to-destination 172.31.16.13:9090
-A KUBE-SEP-RF5ORNUSDLCPBDFL -s 172.31.56.8/32 -m comment --comment "cfcvisualizer/cfc1-visualizer:web" -j KUBE-MARK-MASQ
-A KUBE-SEP-RF5ORNUSDLCPBDFL -p tcp -m comment --comment "cfcvisualizer/cfc1-visualizer:web" -m tcp -j DNAT --to-destination 172.31.56.8:8080
-A KUBE-SEP-SBUZYW6REI5IHNXW -s 172.31.56.13/32 -m comment --comment "kube-extra/nginx-ingress-controller:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-SBUZYW6REI5IHNXW -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:https" -m tcp -j DNAT --to-destination 172.31.56.13:443
-A KUBE-SEP-U23MNYABAPTDSZ3Z -s 172.31.56.13/32 -m comment --comment "kube-extra/nginx-ingress-controller:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-U23MNYABAPTDSZ3Z -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:http" -m tcp -j DNAT --to-destination 172.31.56.13:80
-A KUBE-SEP-V7HFTMXQJ4MBLXA6 -s 172.31.16.17/32 -m comment --comment "kube-e2etests-http-update/kubee2etests:" -j KUBE-MARK-MASQ
-A KUBE-SEP-V7HFTMXQJ4MBLXA6 -p tcp -m comment --comment "kube-e2etests-http-update/kubee2etests:" -m tcp -j DNAT --to-destination 172.31.16.17:80
-A KUBE-SEP-WQ3TYKXP7F26PYTE -s 172.31.24.17/32 -m comment --comment "kube-extra/registry-mirror-ospcfc:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-WQ3TYKXP7F26PYTE -p tcp -m comment --comment "kube-extra/registry-mirror-ospcfc:https" -m tcp -j DNAT --to-destination 172.31.24.17:5000
-A KUBE-SEP-WQCEDX6RSXGUJRIH -s 10.118.10.109/32 -m comment --comment "default/kubernetes:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-WQCEDX6RSXGUJRIH -p tcp -m comment --comment "default/kubernetes:https" -m recent --set --name KUBE-SEP-WQCEDX6RSXGUJRIH --mask 255.255.255.255 --rsource -m tcp -j DNAT --to-destination 10.118.10.109:443
-A KUBE-SEP-XJLLKZMUATQQRDBE -s 172.31.24.7/32 -m comment --comment "kube-extra/registry-mirror-oceanreleases:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-XJLLKZMUATQQRDBE -p tcp -m comment --comment "kube-extra/registry-mirror-oceanreleases:https" -m tcp -j DNAT --to-destination 172.31.24.7:5000
-A KUBE-SEP-YFMETSEPEPL5MZTR -s 172.31.16.12/32 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ
-A KUBE-SEP-YFMETSEPEPL5MZTR -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination 172.31.16.12:53
-A KUBE-SERVICES -d 172.31.129.114/32 -p tcp -m comment --comment "cfcoutbounddocs/template-hook:template-hook cluster IP" -m tcp --dport 80 -j KUBE-SVC-XIG6MK4FMRNPQBKQ
-A KUBE-SERVICES -d 172.31.129.209/32 -p tcp -m comment --comment "kube-monitoring/e2etests-status:http cluster IP" -m tcp --dport 80 -j KUBE-SVC-FCXHW3UZM24LF5S3
-A KUBE-SERVICES -d 172.31.129.5/32 -p tcp -m comment --comment "kube-extra/registry-mirror-gcr:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-YNN6W32JE36JQY6C
-A KUBE-SERVICES -d 172.31.130.83/32 -p tcp -m comment --comment "kube-extra/prometheus-operator:http cluster IP" -m tcp --dport 8080 -j KUBE-SVC-JFYOZBC5JJM6SW3E
-A KUBE-SERVICES -d 172.31.129.52/32 -p tcp -m comment --comment "isup/isup: cluster IP" -m tcp --dport 80 -j KUBE-SVC-IDSLTWPEKDYIGO53
-A KUBE-SERVICES -d 172.31.129.45/32 -p tcp -m comment --comment "kube-extra/registry-mirror-oceanreleases:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-YZZTCQRXVLATHIAK
-A KUBE-SERVICES -d 172.31.128.26/32 -p tcp -m comment --comment "kube-e2etests-http/kubee2etests: cluster IP" -m tcp --dport 10 -j KUBE-SVC-DHUDXRTLNJWBORS6
-A KUBE-SERVICES -d 172.31.131.105/32 -p tcp -m comment --comment "screenful/screenful:api cluster IP" -m tcp --dport 4000 -j KUBE-SVC-4M4DNSF4I6KEKCAI
-A KUBE-SERVICES -d 172.31.129.233/32 -p tcp -m comment --comment "kube-extra/default-http-backend:http cluster IP" -m tcp --dport 80 -j KUBE-SVC-3YSE6J6DKVCBUJMI
-A KUBE-SERVICES -d 172.31.128.150/32 -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:scanningrig-freezer-wcs cluster IP" -m tcp --dport 30000 -j KUBE-SVC-OZQJFAV7V22GSG3X
-A KUBE-SERVICES -d 172.31.128.186/32 -p tcp -m comment --comment "kube-e2etests-http-update/kubee2etests: cluster IP" -m tcp --dport 10 -j KUBE-SVC-XHIIIAPK4PYSZYSE
-A KUBE-SERVICES -d 172.31.128.207/32 -p tcp -m comment --comment "jetbrainslicense/prometheus:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-B3QNTUNNAMROEJGW
-A KUBE-SERVICES -d 172.31.129.97/32 -p tcp -m comment --comment "cfcoutbounddocs/gollum:http cluster IP" -m tcp --dport 80 -j KUBE-SVC-KPEG4KXA2VRTBHTG
-A KUBE-SERVICES -d 172.31.128.137/32 -p tcp -m comment --comment "screenful/postgres:postgres cluster IP" -m tcp --dport 5432 -j KUBE-SVC-2H5WWL56CE74R6US
-A KUBE-SERVICES -d 172.31.128.2/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 172.31.129.64/32 -p tcp -m comment --comment "kube-extra/registry-mirror-hub:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-4KXUCK74MO3GQCXU
-A KUBE-SERVICES -d 172.31.131.99/32 -p tcp -m comment --comment "kube-monitoring/kube-state-metrics:http-metrics cluster IP" -m tcp --dport 8080 -j KUBE-SVC-AT4GZAEQMF4HLK53
-A KUBE-SERVICES -d 172.31.128.124/32 -p tcp -m comment --comment "ospcfccloudplatform/pagerduty-proxy:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-RFEWFH5KXN3S5IUZ
-A KUBE-SERVICES -d 172.31.128.150/32 -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-BAGAGJF3VCWDN7J4
-A KUBE-SERVICES -d 172.31.129.121/32 -p tcp -m comment --comment "cfcvisualizer/cfc1-visualizer:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-IB7EXPLWLWRICQP2
-A KUBE-SERVICES -d 172.31.128.176/32 -p tcp -m comment --comment "kube-monitoring/e2etests-prometheus:prometheus cluster IP" -m tcp --dport 80 -j KUBE-SVC-NADYEGAMELJXYB22
-A KUBE-SERVICES -d 172.31.128.51/32 -p tcp -m comment --comment "kufi/kufi:http-metrics cluster IP" -m tcp --dport 80 -j KUBE-SVC-PQQ3MNWQ6ABTL4V3
-A KUBE-SERVICES -d 172.31.128.2/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SERVICES -d 172.31.129.203/32 -p tcp -m comment --comment "kube-monitoring/weave-scope-app:app cluster IP" -m tcp --dport 80 -j KUBE-SVC-NVTFCNULI3Y3YUAZ
-A KUBE-SERVICES -d 172.31.128.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 172.31.129.224/32 -p tcp -m comment --comment "kube-extra/registry-mirror-quay:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-KZIWI6ZSF2FW4XYS
-A KUBE-SERVICES -d 172.31.128.233/32 -p tcp -m comment --comment "kube-extra/registry-mirror-internal:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-F7IWZEEUI5ZAH7RS
-A KUBE-SERVICES -d 172.31.128.181/32 -p tcp -m comment --comment "kube-monitoring/grafana:web cluster IP" -m tcp --dport 3000 -j KUBE-SVC-X5DJKYMEOW7KWC7G
-A KUBE-SERVICES -d 172.31.130.89/32 -p tcp -m comment --comment "gandalf/gandalf: cluster IP" -m tcp --dport 80 -j KUBE-SVC-GJFMYARFU4V4XKG3
-A KUBE-SERVICES -d 172.31.130.115/32 -p tcp -m comment --comment "kube-monitoring/prometheus:web cluster IP" -m tcp --dport 9090 -j KUBE-SVC-U3722Z4PISVTHSIP
-A KUBE-SERVICES -d 172.31.129.122/32 -p tcp -m comment --comment "jetbrainslicense/app:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-FPVLEE3QB77AOS4G
-A KUBE-SERVICES -d 172.31.128.244/32 -p tcp -m comment --comment "ospcfccloudplatform/slack-proxy:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-PXI6OZZMUNBHO43Q
-A KUBE-SERVICES -d 172.31.128.60/32 -p tcp -m comment --comment "kube-e2etests-deployment-scale-service/kubee2etests: cluster IP" -m tcp --dport 10 -j KUBE-SVC-I256EYRIZRR4EWBV
-A KUBE-SERVICES -d 172.31.128.77/32 -p tcp -m comment --comment "cfcvisualizer/cfc2-visualizer:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-6HTUXGVEB5TGYK3K
-A KUBE-SERVICES -d 172.31.128.113/32 -p tcp -m comment --comment "plantuml/plantuml:web cluster IP" -m tcp --dport 8080 -j KUBE-SVC-6G7NRX456DTKJAZT
-A KUBE-SERVICES -d 172.31.129.155/32 -p tcp -m comment --comment "cfcoutbounddocs/wiki-hook:wiki-hook cluster IP" -m tcp --dport 80 -j KUBE-SVC-EABB5ZD3QVIYK33K
-A KUBE-SERVICES -d 172.31.131.39/32 -p tcp -m comment --comment "kube-system/heapster: cluster IP" -m tcp --dport 80 -j KUBE-SVC-BJM46V3U5RZHCFRZ
-A KUBE-SERVICES -d 172.31.130.40/32 -p tcp -m comment --comment "kube-extra/kubernetes-dashboard: cluster IP" -m tcp --dport 80 -j KUBE-SVC-KZ56ELFAUCIGRFV6
-A KUBE-SERVICES -d 172.31.131.105/32 -p tcp -m comment --comment "screenful/screenful:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-YN4PYMRMOF3OES5E
-A KUBE-SERVICES -d 172.31.131.239/32 -p tcp -m comment --comment "ospcfccloudplatform/alertmanager:http-metrics cluster IP" -m tcp --dport 9090 -j KUBE-SVC-KV2GSBGFJ4SJNZ5U
-A KUBE-SERVICES -d 172.31.128.188/32 -p tcp -m comment --comment "kube-extra/registry-mirror-ospcfc:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-ZI22QPPC6BY5KODT
-A KUBE-SERVICES -d 172.31.131.185/32 -p tcp -m comment --comment "kufi/prometheus:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-YEKPHTHQKQIELXVJ
-A KUBE-SERVICES -d 172.31.129.48/32 -p tcp -m comment --comment "kube-e2etests-deployment-service/kubee2etests: cluster IP" -m tcp --dport 10 -j KUBE-SVC-POWXKCKDC3QGLZYQ
-A KUBE-SERVICES -d 172.31.128.150/32 -p tcp -m comment --comment "kube-extra/nginx-ingress-controller:http cluster IP" -m tcp --dport 80 -j KUBE-SVC-6SHOD27LJ4JHFMBU
-A KUBE-SERVICES -d 172.31.128.56/32 -p tcp -m comment --comment "gandalf/smtp: cluster IP" -m tcp --dport 25 -j KUBE-SVC-HTZZPU24M3GCLVIG
-A KUBE-SERVICES -d 172.31.129.96/32 -p tcp -m comment --comment "kube-extra/registry-mirror-ocean:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-4L43JJRJOXWXYSYV
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS
-A KUBE-SVC-2H5WWL56CE74R6US -m comment --comment "screenful/postgres:postgres" -j KUBE-SEP-N4KWIQINASDGWGIJ
-A KUBE-SVC-3YSE6J6DKVCBUJMI -m comment --comment "kube-extra/default-http-backend:http" -j KUBE-SEP-IFO4224EIEA4JSJJ
-A KUBE-SVC-4KXUCK74MO3GQCXU -m comment --comment "kube-extra/registry-mirror-hub:https" -j KUBE-SEP-HEYMHQVMDGZCYJZM
-A KUBE-SVC-4L43JJRJOXWXYSYV -m comment --comment "kube-extra/registry-mirror-ocean:https" -j KUBE-SEP-2KXSWJ3G64R65RSP
-A KUBE-SVC-6SHOD27LJ4JHFMBU -m comment --comment "kube-extra/nginx-ingress-controller:http" -j KUBE-SEP-U23MNYABAPTDSZ3Z
-A KUBE-SVC-AT4GZAEQMF4HLK53 -m comment --comment "kube-monitoring/kube-state-metrics:http-metrics" -j KUBE-SEP-JMFPT52EVMG2AWSX
-A KUBE-SVC-B3QNTUNNAMROEJGW -m comment --comment "jetbrainslicense/prometheus:web" -j KUBE-SEP-EM4A7OIKMC6H2HNS
-A KUBE-SVC-BAGAGJF3VCWDN7J4 -m comment --comment "kube-extra/nginx-ingress-controller:https" -j KUBE-SEP-SBUZYW6REI5IHNXW
-A KUBE-SVC-BJM46V3U5RZHCFRZ -m comment --comment "kube-system/heapster:" -j KUBE-SEP-ABDII4QYMLMHPYA5
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-SEP-YFMETSEPEPL5MZTR
-A KUBE-SVC-GJFMYARFU4V4XKG3 -m comment --comment "gandalf/gandalf:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-GJD4Q2SPT7EL6UD2
-A KUBE-SVC-GJFMYARFU4V4XKG3 -m comment --comment "gandalf/gandalf:" -j KUBE-SEP-3J6NXYWQBOXIKDXB
-A KUBE-SVC-HTZZPU24M3GCLVIG -m comment --comment "gandalf/smtp:" -j KUBE-SEP-3MOV7KWFT6UOYARV
-A KUBE-SVC-I256EYRIZRR4EWBV -m comment --comment "kube-e2etests-deployment-scale-service/kubee2etests:" -j KUBE-SEP-HNAQAA5N5P44RFQM
-A KUBE-SVC-IB7EXPLWLWRICQP2 -m comment --comment "cfcvisualizer/cfc1-visualizer:web" -j KUBE-SEP-RF5ORNUSDLCPBDFL
-A KUBE-SVC-KV2GSBGFJ4SJNZ5U -m comment --comment "ospcfccloudplatform/alertmanager:http-metrics" -j KUBE-SEP-KX23O7ZWXBQIDYQQ
-A KUBE-SVC-KZ56ELFAUCIGRFV6 -m comment --comment "kube-extra/kubernetes-dashboard:" -j KUBE-SEP-R2RKL262VYLJJJ6A
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -m recent --rcheck --seconds 10800 --reap --name KUBE-SEP-WQCEDX6RSXGUJRIH --mask 255.255.255.255 --rsource -j KUBE-SEP-WQCEDX6RSXGUJRIH
-A KUBE-SVC-NPX46M4PTMTKRN6Y -m comment --comment "default/kubernetes:https" -j KUBE-SEP-WQCEDX6RSXGUJRIH
-A KUBE-SVC-OZQJFAV7V22GSG3X -m comment --comment "kube-extra/nginx-ingress-controller:scanningrig-freezer-wcs" -j KUBE-SEP-5H7MKIBPE32CQD3K
-A KUBE-SVC-PXI6OZZMUNBHO43Q -m comment --comment "ospcfccloudplatform/slack-proxy:web" -j KUBE-SEP-I63WX7TZ7WRRCAO3
-A KUBE-SVC-RFEWFH5KXN3S5IUZ -m comment --comment "ospcfccloudplatform/pagerduty-proxy:web" -j KUBE-SEP-QFD7WIJ65TV7YG4S
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -j KUBE-SEP-QY5RVKWGFSN5VJ6B
-A KUBE-SVC-XHIIIAPK4PYSZYSE -m comment --comment "kube-e2etests-http-update/kubee2etests:" -j KUBE-SEP-V7HFTMXQJ4MBLXA6
-A KUBE-SVC-YEKPHTHQKQIELXVJ -m comment --comment "kufi/prometheus:web" -j KUBE-SEP-DV6SJO4IQHTDYRZ5
-A KUBE-SVC-YNN6W32JE36JQY6C -m comment --comment "kube-extra/registry-mirror-gcr:https" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-EH66SQ2S7KAF5AVN
-A KUBE-SVC-YNN6W32JE36JQY6C -m comment --comment "kube-extra/registry-mirror-gcr:https" -j KUBE-SEP-LWPDFQBPELW5WK2N
-A KUBE-SVC-YZZTCQRXVLATHIAK -m comment --comment "kube-extra/registry-mirror-oceanreleases:https" -j KUBE-SEP-XJLLKZMUATQQRDBE
-A KUBE-SVC-ZI22QPPC6BY5KODT -m comment --comment "kube-extra/registry-mirror-ospcfc:https" -j KUBE-SEP-WQ3TYKXP7F26PYTE
-A WEAVE -s 172.31.0.0/17 -d 224.0.0.0/4 -j RETURN
-A WEAVE ! -s 172.31.0.0/17 -d 172.31.0.0/17 -j MASQUERADE
-A WEAVE -s 172.31.0.0/17 ! -d 172.31.0.0/17 -j MASQUERADE
COMMIT
# Completed on Mon Feb 26 10:48:08 2018
# Generated by iptables-save v1.4.21 on Mon Feb 26 10:48:08 2018
*filter
:INPUT ACCEPT [245:420936]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [292:280248]
:DOCKER - [0:0]
:DOCKER-ISOLATION - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-SERVICES - [0:0]
:WEAVE-IPSEC-IN - [0:0]
:WEAVE-NPC - [0:0]
:WEAVE-NPC-DEFAULT - [0:0]
:WEAVE-NPC-INGRESS - [0:0]
-A INPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A INPUT -j KUBE-FIREWALL
-A INPUT -j WEAVE-IPSEC-IN
-A FORWARD -o weave -m comment --comment "NOTE: this must go before \'-j KUBE-FORWARD\'" -j WEAVE-NPC
-A FORWARD -o weave -m state --state NEW -j NFLOG --nflog-group 86
-A FORWARD -o weave -j DROP
-A FORWARD -i weave ! -o weave -j ACCEPT
-A FORWARD -o weave -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -m comment --comment "kubernetes forward rules" -j KUBE-FORWARD
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -j KUBE-FIREWALL
-A OUTPUT ! -p esp -m policy --dir out --pol none -m mark --mark 0x20000/0x20000 -j DROP
-A DOCKER-ISOLATION -j RETURN
-A KUBE-FIREWALL -m comment --comment "kubernetes firewall for dropping marked packets" -m mark --mark 0x8000/0x8000 -j DROP
-A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
-A KUBE-SERVICES -d 172.31.129.114/32 -p tcp -m comment --comment "cfcoutbounddocs/template-hook:template-hook has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.209/32 -p tcp -m comment --comment "kube-monitoring/e2etests-status:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.130.83/32 -p tcp -m comment --comment "kube-extra/prometheus-operator:http has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.52/32 -p tcp -m comment --comment "isup/isup: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.128.26/32 -p tcp -m comment --comment "kube-e2etests-http/kubee2etests: has no endpoints" -m tcp --dport 10 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.131.105/32 -p tcp -m comment --comment "screenful/screenful:api has no endpoints" -m tcp --dport 4000 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.97/32 -p tcp -m comment --comment "cfcoutbounddocs/gollum:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.128.176/32 -p tcp -m comment --comment "kube-monitoring/e2etests-prometheus:prometheus has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.128.51/32 -p tcp -m comment --comment "kufi/kufi:http-metrics has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.203/32 -p tcp -m comment --comment "kube-monitoring/weave-scope-app:app has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -p tcp -m comment --comment "kube-extra/registry-mirror-quay:https has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 30287 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.224/32 -p tcp -m comment --comment "kube-extra/registry-mirror-quay:https has no endpoints" -m tcp --dport 443 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -p tcp -m comment --comment "kube-extra/registry-mirror-internal:https has no endpoints" -m addrtype --dst-type LOCAL -m tcp --dport 30560 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.128.233/32 -p tcp -m comment --comment "kube-extra/registry-mirror-internal:https has no endpoints" -m tcp --dport 443 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.128.181/32 -p tcp -m comment --comment "kube-monitoring/grafana:web has no endpoints" -m tcp --dport 3000 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.130.115/32 -p tcp -m comment --comment "kube-monitoring/prometheus:web has no endpoints" -m tcp --dport 9090 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.122/32 -p tcp -m comment --comment "jetbrainslicense/app:web has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.128.77/32 -p tcp -m comment --comment "cfcvisualizer/cfc2-visualizer:web has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.128.113/32 -p tcp -m comment --comment "plantuml/plantuml:web has no endpoints" -m tcp --dport 8080 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.155/32 -p tcp -m comment --comment "cfcoutbounddocs/wiki-hook:wiki-hook has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.131.105/32 -p tcp -m comment --comment "screenful/screenful:web has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
-A KUBE-SERVICES -d 172.31.129.48/32 -p tcp -m comment --comment "kube-e2etests-deployment-service/kubee2etests: has no endpoints" -m tcp --dport 10 -j REJECT --reject-with icmp-port-unreachable
-A WEAVE-NPC -m state --state RELATED,ESTABLISHED -j ACCEPT
-A WEAVE-NPC -d 224.0.0.0/4 -j ACCEPT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-DEFAULT
-A WEAVE-NPC -m state --state NEW -j WEAVE-NPC-INGRESS
-A WEAVE-NPC -m set ! --match-set weave-local-pods dst -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-#EheO3ZItEJ.^x%PP*MH[/9E+ dst -m comment --comment "DefaultAllow isolation for namespace: gandalf" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-.fv|;78)[sGktF6)=*]jI4Tzo dst -m comment --comment "DefaultAllow isolation for namespace: gitlabrunnerocean" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-GaJTl9EvYyV)Ui{J20t!7~+(H dst -m comment --comment "DefaultAllow isolation for namespace: kube-extra" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-t9WH]%=*W+xUp]c*NGE.lh258 dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-deployment-scale" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-;oEHxPZ|LYQq3KH=Z7.9+WiZz dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-service" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-?b%zl9GIe0AET1(QI^7NWe*fO dst -m comment --comment "DefaultAllow isolation for namespace: kube-system" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-TzHghne5mZ=1atGdjfBnT]l$t dst -m comment --comment "DefaultAllow isolation for namespace: screenful" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-zcX@UIG[?l$U%$E#%D$#coMtR dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-deployment-pvc" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-!PXEDV0])TiN5$Ka{}?|Y=T@v dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-deployment-scale-service" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-+lUL3KHugjNi|hV_4BE)7KB6( dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-deployment-service" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-J|~O_zT{YJG@Wk1K*J4Xa%F~6 dst -m comment --comment "DefaultAllow isolation for namespace: plantuml" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-E.1.0W^NGSp]0_t5WwH/]gX@L dst -m comment --comment "DefaultAllow isolation for namespace: default" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-x3(I@nj+n0UeHz:1s[0Vtq$^E dst -m comment --comment "DefaultAllow isolation for namespace: jira" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-_;}sn~[IsWL91CmlA^OZ.YFNF dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-deployment" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-B#x0ps?1iBmscA+9EEF*O!tL] dst -m comment --comment "DefaultAllow isolation for namespace: kube-monitoring" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-0EHD/vdN#O4]V?o4Tx7kS;APH dst -m comment --comment "DefaultAllow isolation for namespace: kube-public" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-[^!sa.btx]%tLV%@B+ydb3pnv dst -m comment --comment "DefaultAllow isolation for namespace: wibble" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-m;%)4zjL=PkY4X+f:!KDJzo^t dst -m comment --comment "DefaultAllow isolation for namespace: cfcoutbounddocs" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-iwih12$kp.{h5MubUa7Sry;KH dst -m comment --comment "DefaultAllow isolation for namespace: k8sbackup" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-0n[?CWWqs9TdzQdqfi7u!sR~$ dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-deployment-update" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-gzf@2cfciqd8j{O]xX/V=.FuB dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-http-update" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-nsgme7?2w4ng/cvjqQ[Q]9kgb dst -m comment --comment "DefaultAllow isolation for namespace: referee" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-o*_v7aCg6Eoi|c??FXQ=FNfJz dst -m comment --comment "DefaultAllow isolation for namespace: cfcvisualizer" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-x_{Gk/BE.SzODFcYd~mr*{qjr dst -m comment --comment "DefaultAllow isolation for namespace: gitlab-ci-runner" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-!94c*z+5Z4_*UnD=wQYvR7VK5 dst -m comment --comment "DefaultAllow isolation for namespace: gitlabrunneratmosphere" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-Ht2!qT$(=SFvg.@H+.YIa**?7 dst -m comment --comment "DefaultAllow isolation for namespace: jetbrainslicense" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-x@!J22P{IAf[##kV$XcGDgU.P dst -m comment --comment "DefaultAllow isolation for namespace: kufi" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-o*_rB|Y:TBp_x(GZUtHh@oaoU dst -m comment --comment "DefaultAllow isolation for namespace: gitlabrunnerospcfc" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-k}p)VSR73ma+sXQn^Y)jrb2sV dst -m comment --comment "DefaultAllow isolation for namespace: isup" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-x69P=(mxyEVYqgnA8vGKCD%@S dst -m comment --comment "DefaultAllow isolation for namespace: kube-e2etests-http" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-vzOp$M]@$|DaK5T+^CH|V|%?q dst -m comment --comment "DefaultAllow isolation for namespace: ospcfccloudplatform" -j ACCEPT
-A WEAVE-NPC-DEFAULT -m set --match-set weave-j~!0[i?Vp[9oXoRhHYQe.;6uf dst -m comment --comment "DefaultAllow isolation for namespace: someappid-old" -j ACCEPT
-A WEAVE-NPC-INGRESS -p udp -m set --match-set weave-zEoo^_Rj?zrFYdzVmFr+/0pSC src -m set --match-set weave-zEoo^_Rj?zrFYdzVmFr+/0pSC dst -m udp --dport 6783 -m comment --comment "pods: namespace: kube-system, selector: name=weave-net -> pods: namespace: kube-system, selector: name=weave-net" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-zEoo^_Rj?zrFYdzVmFr+/0pSC src -m set --match-set weave-zEoo^_Rj?zrFYdzVmFr+/0pSC dst -m tcp --dport 6784 -m comment --comment "pods: namespace: kube-system, selector: name=weave-net -> pods: namespace: kube-system, selector: name=weave-net" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-EILUbui)olb+;L0.jx^wcdopt src -m set --match-set weave-zEoo^_Rj?zrFYdzVmFr+/0pSC dst -m tcp --dport 6782 -m comment --comment "namespaces: selector: appId=kube,component=monitoring -> pods: namespace: kube-system, selector: name=weave-net" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-zEoo^_Rj?zrFYdzVmFr+/0pSC src -m set --match-set weave-zEoo^_Rj?zrFYdzVmFr+/0pSC dst -m tcp --dport 6783 -m comment --comment "pods: namespace: kube-system, selector: name=weave-net -> pods: namespace: kube-system, selector: name=weave-net" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-9_PlPr(s6{H85!WoFBPVFC#9u dst -m tcp --dport 8080 -m comment --comment "anywhere -> pods: namespace: plantuml, selector: app=plantuml" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-p][tP41OWT#]xFiMbn*qOsqqk dst -m tcp --dport 80 -m comment --comment "anywhere -> pods: namespace: cfcoutbounddocs, selector: app=cfcoutbounddocs" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-p][tP41OWT#]xFiMbn*qOsqqk dst -m tcp --dport 8081 -m comment --comment "anywhere -> pods: namespace: cfcoutbounddocs, selector: app=cfcoutbounddocs" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-p][tP41OWT#]xFiMbn*qOsqqk dst -m tcp --dport 8082 -m comment --comment "anywhere -> pods: namespace: cfcoutbounddocs, selector: app=cfcoutbounddocs" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-psK~Lggl{2*BD]fxz.!{Xu){s dst -m tcp --dport 3000 -m comment --comment "anywhere -> pods: namespace: kube-monitoring, selector: app=grafana" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-f(o|hf6K([;umS[JRWSHgn/3L dst -m tcp --dport 9090 -m comment --comment "anywhere -> pods: namespace: kube-monitoring, selector: app=prometheus,prometheus=k8s" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-!c9pI_##3e58_C5?kpcLE3U:M dst -m tcp --dport 4040 -m comment --comment "anywhere -> pods: namespace: kube-monitoring, selector: app=weave-scope,weave-scope-component=app" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-EILUbui)olb+;L0.jx^wcdopt src -m set --match-set weave-Q[jhI=WQ~)k6~AgT2mcT$dg[c dst -m tcp --dport 10252 -m comment --comment "namespaces: selector: appId=kube,component=monitoring -> pods: namespace: kube-system, selector: k8s-app=kube-controller-manager" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-QYIlr%V?kjp#HJ7cHYQjMOivC dst -m tcp --dport 4000 -m comment --comment "anywhere -> pods: namespace: screenful, selector: app=screenful" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-QYIlr%V?kjp#HJ7cHYQjMOivC dst -m tcp --dport 80 -m comment --comment "anywhere -> pods: namespace: screenful, selector: app=screenful" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-ST*pe=)mU+~l#ghpA0xl5M[J} dst -m tcp --dport 8080 -m comment --comment "anywhere -> pods: namespace: cfcvisualizer, selector: app=cfcvisualizer" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-ST*pe=)mU+~l#ghpA0xl5M[J} dst -m tcp --dport 8081 -m comment --comment "anywhere -> pods: namespace: cfcvisualizer, selector: app=cfcvisualizer" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-$?1kPZe/X%44ZOmOods4]Ib}N dst -m tcp --dport 8081 -m comment --comment "anywhere -> pods: namespace: kube-monitoring, selector: app=e2etests" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-f(o|hf6K([;umS[JRWSHgn/3L src -m set --match-set weave-$?1kPZe/X%44ZOmOods4]Ib}N dst -m tcp --dport 9102 -m comment --comment "pods: namespace: kube-monitoring, selector: app=prometheus,prometheus=k8s -> pods: namespace: kube-monitoring, selector: app=e2etests" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-EILUbui)olb+;L0.jx^wcdopt src -m set --match-set weave-?%D#jARXI/G:A}z8dn=9/Eq.L dst -m tcp --dport 10251 -m comment --comment "namespaces: selector: appId=kube,component=monitoring -> pods: namespace: kube-system, selector: k8s-app=kube-scheduler" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-QYIlr%V?kjp#HJ7cHYQjMOivC src -m set --match-set weave-5*HDryZlO=b4*738kizjErDH) dst -m tcp --dport 5432 -m comment --comment "pods: namespace: screenful, selector: app=screenful -> pods: namespace: screenful, selector: app=postgres" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-vwjtOZ;zXcva[Wp=L)BmY|SI9 dst -m tcp --dport 8080 -m comment --comment "anywhere -> pods: namespace: kube-monitoring, selector: app=kube-state-metrics" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-f(o|hf6K([;umS[JRWSHgn/3L src -m set --match-set weave-!0?~=:Rt]u)YW7GBSl]BUCc^Z dst -m tcp --dport 9100 -m comment --comment "pods: namespace: kube-monitoring, selector: app=prometheus,prometheus=k8s -> pods: namespace: kube-monitoring, selector: app=node-exporter" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-AP^MH|~0Y8)_jx|h6*p%/F[vH dst -m tcp --dport 443 -m comment --comment "anywhere -> pods: namespace: kube-system, selector: k8s-app=kube-apiserver" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-EILUbui)olb+;L0.jx^wcdopt src -m set --match-set weave-g2L[E!OSd/v~?QhC7d{GUA2^[ dst -m tcp --dport 10054 -m comment --comment "namespaces: selector: appId=kube,component=monitoring -> pods: namespace: kube-system, selector: k8s-app=kube-dns" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-EILUbui)olb+;L0.jx^wcdopt src -m set --match-set weave-g2L[E!OSd/v~?QhC7d{GUA2^[ dst -m tcp --dport 10055 -m comment --comment "namespaces: selector: appId=kube,component=monitoring -> pods: namespace: kube-system, selector: k8s-app=kube-dns" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-g2L[E!OSd/v~?QhC7d{GUA2^[ dst -m tcp --dport 53 -m comment --comment "anywhere -> pods: namespace: kube-system, selector: k8s-app=kube-dns" -j ACCEPT
-A WEAVE-NPC-INGRESS -p udp -m set --match-set weave-g2L[E!OSd/v~?QhC7d{GUA2^[ dst -m udp --dport 53 -m comment --comment "anywhere -> pods: namespace: kube-system, selector: k8s-app=kube-dns" -j ACCEPT
-A WEAVE-NPC-INGRESS -p tcp -m set --match-set weave-fMg_;:uQnd6|.+W$[ZPB/K2B] dst -m tcp --dport 8082 -m comment --comment "anywhere -> pods: namespace: kube-system, selector: k8s-app=heapster" -j ACCEPT
COMMIT
bboreham commented 6 years ago

Is this only on one node or on several?

That path should be mapped (by the DaemonSet) from /var/lib/weave - could you run lsof on the host and see if something else has the file open?

dogopupper commented 6 years ago

several nodes:

weave-net-4cr6k                                                   1/2       CrashLoopBackOff    822        3d
weave-net-98wv4                                                   2/2       Running             45         3d
weave-net-9mj8d                                                   1/2       Running             40         3d
weave-net-cdjzb                                                   1/2       Running             18         3d
weave-net-f6tzq                                                   2/2       Running             1          3d
weave-net-lkzn7                                                   1/2       CrashLoopBackOff    823        3d
weave-net-rtnwm                                                   2/2       Running             2          3d
weave-net-tdpfm                                                   1/2       CrashLoopBackOff    57         3d
weave-net-v9fcx                                                   2/2       Running             3          3d
weave-net-x9cjn                                                   2/2       Running             4          3d
weave-net-xrjgb                                                   1/2       CrashLoopBackOff    821        3d

only the ones with 2/2 work, about 40% of them

It appears to be bound to weave-kube , the weave container named weave-kube as well:

  Volumes:
   weavedb:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/weave-kube

lsof - node with healthy weave: multiple entries of

weaver    19727 19892             root  mem-W     REG                8,9              1058458 /weavedb/weave-netdata.db (stat: No such file or directory)
weaver    19727 19892             root   11uW     REG                8,9     65536    1058458 /weavedb/weave-netdata.db

lsof - node with unhealthy weave: same as above but additional entries of:

weaveutil 23488                   root    3u      REG                8,9     65536     534136 /weavedb/weave-netdata.db
weaveutil 23488   976             root    3u      REG                8,9     65536     534136 /weavedb/weave-netdata.db
weaveutil 23488  2558             root    3u      REG                8,9     65536     534136 /weavedb/weave-netdata.db
weaveutil 23488  3413             root    3u      REG                8,9     65536     534136 /weavedb/weave-netdata.db
weaveutil 23488 23489             root    3u      REG                8,9     65536     534136 /weavedb/weave-netdata.db

what is weaveutil? seems like the culprit here

bboreham commented 6 years ago

weaveutil is where we put various bits of logic we moved from shell-script to Go. In every case it should run briefly and then exit.

It does open weave-netdata.db to fetch an ID, but again that should be very brief.

Also I would not expect more than one of them at a time; could be an artefact of Kubernetes crash-loop restart?

Since it's written in Go, sending it a SIGQUIT should make it dump a stack trace to stderr then exit. That stderr should then be in the weave container logs. Please try this and see what you get. Probably the oldest weaveutil on a host is the most interesting one.

dogopupper commented 6 years ago

kubectl logs -c weave and docker logs both show only the line of 'cannot open weavedb', How to find the stacktrace?

bboreham commented 6 years ago

It would be in the log of the container where weaveutil was running. Maybe something else is going on - could you be running another instance of weaver (the daemon itself) on that machine?

dogopupper commented 6 years ago

there doesn't seem to be any weaver daemon running in the nodes...

dogopupper commented 6 years ago
bboreham commented 6 years ago

You're grepping the binary - the control plane for the Weave Network. The binary contains all possible error messages.

there doesn't seem to be any weaver daemon running in the nodes...

You showed weaver in the output of lsof above. It would be visible with a ps -eaf or similar run on the host.

dogopupper commented 6 years ago

only 1 weaver running-

$ ps -eaf | grep weaver root 8532 1 6 Feb22 ? 05:39:56 /home/weave/weaver --port=6783 --datapath=datapath --name=46:d9:39:96:93:ef --host-root=/host --http-addr=127.0.0.1:6784 --status-addr=0.0.0.0:6782 --docker-api= --no-dns --db-prefix=/weavedb/weave-net --ipalloc-range=172.31.0.0/17 --nickname=devtoolskubernetes-kubernetes-cr1-4-1519310384 --ipalloc-init consensus=12 --conn-limit=100 --expect-npc 10.118.10.92 10.118.10.93 10.118.10.94 10.118.10.88 10.118.10.108 10.118.10.109 10.118.10.110 10.118.10.111 10.118.10.104 10.118.10.105 10.118.10.107 10.118.10.106

non-working weaver: $ sudo cat /proc/8532/stack [] __refrigerator+0x73/0x160 [] get_signal+0x5c6/0x5d0 [] do_signal+0x36/0x610 [] exit_to_usermode_loop+0x71/0xb0 [] do_syscall_64+0xe9/0x1c0 [] entry_SYSCALL64_slow_path+0x25/0x25 [] 0xffffffffffffffff

working weaver: [] skb_wait_for_more_packets+0x103/0x160 [] skb_recv_datagram+0x6a/0xc0 [] skb_recv_datagram+0x3f/0x60 [] netlink_recvmsg+0x57/0x3e0 [] SYSC_recvfrom+0xc3/0x130 [] do_syscall_64+0x59/0x1c0 [] entry_SYSCALL64_slow_path+0x25/0x25 [] 0xffffffffffffffff

bboreham commented 6 years ago

The stack trace of interest is in weaveutil - it's holding the file open when it ought to exit very quickly. weaver has printed an error message so we know what it thinks. (Although I would also expect weaver to exit after that error).

Next, the trouble with a raw stacktrace from the process is that Go multiplexes different parts of the program ("goroutines") onto the same OS thread (LWP). I guess it's possible if you printed out stacktraces from all the LWPs in the process one of them might show something interesting.

dogopupper commented 6 years ago

there doesn't seem to be a weaveutil running now in either the working or nonworking nodes:

core@devtoolskubernetes-kubernetes-cr2-4-1519307879 ~ $ ps -eaf | grep weave root 5858 5841 0 Feb22 ? 00:00:00 /home/weave/runsvinit root 9443 9425 0 Feb22 ? 00:09:24 /usr/bin/weave-npc root 10954 10935 0 Feb22 ? 00:00:00 /bin/sh /home/weave/launch.sh root 11015 10954 4 Feb22 ? 04:44:55 /home/weave/weaver --port=6783 --datapath=datapath --name=a6:06:dd:ed:09:9c --host-root=/host --http-addr=127.0.0.1:6784 --status-addr=0.0.0.0:6782 --docker-api= --no-dns --db-prefix=/weavedb/weave-net --ipalloc-range=172.31.0.0/17 --nickname=devtoolskubernetes-kubernetes-cr2-4-1519307879 --ipalloc-init consensus=10 --conn-limit=100 --expect-npc 10.118.10.92 10.118.10.93 10.118.10.94 10.118.10.88 10.118.10.89 10.118.10.90 10.118.10.91 10.118.10.104 10.118.10.105 10.118.10.106 core 13654 11847 0 13:25 pts/0 00:00:00 grep --colour=auto weave

i'm now getting the following entries with lsof: core@devtoolskubernetes-kubernetes-cr1-3-1519310384 ~ $ sudo lsof | grep weavedb lsof: no pwd entry for UID 100

I guess that's printed in place of the weaveutil entries?

Also, how to print stacktraces from all LWPs?

bboreham commented 6 years ago

lsof: no pwd entry for UID 100

That's a new one on me.

how to print stacktraces from all LWPs?

LWPs are pretty much processes that share the same address space. You can list them with ps -eLf for instance, and then cat /proc/pid/stack as before.

Could you try rebooting one of the non-working nodes? If it's a transient thing that ought to shake it loose, and if fails again that would also be interesting.

dogopupper commented 6 years ago

rebooted the non working nodes, now weave is running in all of them and cluster is back to a healthy state.

If it shows again i'll paste the stacktraces here. Thanks for the help!

dogopupper commented 6 years ago

Happens again in a few nodes:

$ ps -elF | grep weave
4 S root       324 32700  0  80   0 - 110460 -      9468   6 12:49 ?        00:00:03 
/usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
 $ sudo cat /proc/324/stack
[<ffffffff890f6d27>] futex_wait_queue_me+0xc7/0x120
[<ffffffff890f7546>] futex_wait+0xf6/0x250
[<ffffffff890f978f>] do_futex+0x10f/0xb10
[<ffffffff890fa211>] SyS_futex+0x81/0x190
[<ffffffff89003949>] do_syscall_64+0x59/0x1c0
[<ffffffff89800115>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

Looks like a deadlock?

Most other weave-net processes have the same stacktrace, a few have this variation:

 $ sudo cat /proc/2896/stack
[<ffffffff89262f13>] ep_poll+0x2f3/0x3b0
[<ffffffff892647f9>] SyS_epoll_wait+0xb9/0xd0
[<ffffffff89003949>] do_syscall_64+0x59/0x1c0
[<ffffffff89800115>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

the rest:

4 S root      2896  6225  0  80   0 - 64229 -      10312   6 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
0 S core     15347 15015  0  80   0 -  1684 -        908   7 13:22 pts/0    00:00:00 grep --colour=auto weave
4 S root     16580  6225  0  80   0 - 30028 -       8360   0 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16586  6225  0  80   0 - 29348 -       8360   2 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16591  6225  0  80   0 - 29412 -       8240   0 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16597  6225  0  80   0 - 45796 -       8244   5 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16601  6225  0  80   0 - 27299 -       8196   5 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16603  6225  0  80   0 - 46412 -       8228   1 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16608  6225  0  80   0 - 64845 -      10400   3 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16617  6225  0  80   0 - 29348 -       8300   4 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16621  6225  0  80   0 - 64229 -       8256   2 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16742  6225  0  80   0 - 64229 -      10340   0 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16749  6225  0  80   0 - 47781 -      10336   5 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16759  6225  0  80   0 - 64229 -      10284   2 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     16953  6225  0  80   0 - 47845 -       8196   7 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     17014  6225  0  80   0 - 64581 -       8244   7 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     17125 17103  0  80   0 - 208862 -     59820   1 Feb26 ?        00:01:47 /usr/bin/weave-npc
4 S root     19040  6225  0  80   0 - 64229 -       8296   0 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     19047  6225  0  80   0 - 27299 -       8236   4 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     19311  6225  0  80   0 - 64493 -       8240   0 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     19659  6225  0  80   0 - 47845 -       8192   7 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     19712  6225  0  80   0 - 48133 -       8228   7 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 D root     22097     1  0  80   0 - 239040 -     44460   7 Feb26 ?        00:00:18 /home/weave/weaver --port=6783 --datapath=datapath --name=9a:60:a6:d9:ce:a8 --host-root=/host --http-addr=127.0.0.1:6784 --status-addr=0.0.0.0:6782 --docker-api= --no-dns --db-prefix=/weavedb/weave-net --ipalloc-range=172.31.0.0/17 --nickname=devtoolskubernetes-kubernetes-cr0-1-1519311725 --ipalloc-init consensus=11 --conn-limit=100 --expect-npc 10.118.10.112 10.118.10.113 10.118.10.114 10.118.10.108 10.118.10.109 10.118.10.110 10.118.10.111 10.118.10.104 10.118.10.105 10.118.10.107 10.118.10.106
4 S root     24234  6225  0  80   0 - 64229 -      10344   7 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     30085  6225  0  80   0 - 47845 -       8288   1 Feb26 ?        00:00:00 /opt/weave-net/bin/weave-net
4 S root     32700 32681  0  80   0 -   386 -        892   7 12:49 ?        00:00:00 /bin/sh /home/weave/launch.sh
bboreham commented 6 years ago

/usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok

That's useful - I had thought it was a different execution of weaveutil

I'm not clear what that "the rest" list shows.

To make progress I really need to know how we get two processes on the same machine that are both trying to access weave-netdata.db at the same time. What are those processes and where did they come from (are they both in the same container? If not, who started the different containers?)

Looks like a deadlock?

No, that's just a Go program waiting for something to happen. Stack traces from the OS' perspective do not show what a Go program is actually doing, because it switches different "green threads" around onto the OS threads.

dogopupper commented 6 years ago

that's the rest of ps -elF output.

weave runs as a daemonset, a kubectl get pods -o wide | grep weave shows distinct nodes. what else could be trying to access weavedb at initialization within the container?

from what I understand the process tree is: weaver -> weave-npc -> weaveutil ? weave-npc seems to operate normally in all pods.

weaveutil also spawns threads, not sure if this helps:

 $ ps -eLf | grep weavedb
root      6196  6096  6196  0   11 13:17 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096  6197  0   11 13:17 ?        00:00:01 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096  6198  0   11 13:17 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096  6199  0   11 13:17 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096  6200  0   11 13:17 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096 17913  0   11 13:42 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096 17914  0   11 13:42 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096 17915  0   11 13:42 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096 17916  0   11 13:42 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096 18918  0   11 13:44 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
root      6196  6096 18919  0   11 13:44 ?        00:00:00 /usr/bin/weaveutil set-db-flag /weavedb/weave-net peer-reclaim ok
brb commented 6 years ago

Could you give us Go stack traces of the stuck weaveutil process? You can get the traces by:

  1. Run lslocks to get a PID of a process holding a lock (flock(2)) for weave-netdata.db. Faster than with lsof.
  2. Run in the shell strace -e write -s100000 -fp $PID 2> /tmp/weaveutil-strace.
  3. While ^^ is running, run kill -SIGQUIT $PID.
  4. Quit the strace process and upload /tmp/weaveutl-strace.
dogopupper commented 6 years ago

Thanks for the steps, couldn't find how to produce that.

I found that there was a hacky init-container in place that reset the weavedb before starting weave, it was a workaround for a bug that was recently fixed. I've now rebooted the nodes and removed the initcontainer so I don't have a stuck weaveutil until it's hit again - maybe that initcontainer was the problem.

I'll update with the stacktraces if it's hit again.

dogopupper commented 6 years ago

Hit the bug again, got the strace this time ^^

let me know if it's sufficient.

core@devtoolskubernetes-kubernetes-cr0-1-1519311725 ~ $ sudo strace -e write -s100000 -fp 522 2> weaveutil-strace
-- ran kill SIGQUIT from another pty --
core@devtoolskubernetes-kubernetes-cr0-1-1519311725 ~ $ cat weaveutil-strace
Process 522 attached with 5 threads
[pid   522] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=4257, si_uid=0} ---
[pid   522] write(2, "SIGQUIT: quit", 13) = 13
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "PC=", 3)          = 3
[pid   522] write(2, "0x457233", 8)     = 8
[pid   522] write(2, " m=", 3)          = 3
[pid   522] write(2, "0", 1)            = 1
[pid   522] write(2, " sigcode=", 9)    = 9
[pid   522] write(2, "0", 1)            = 1
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "goroutine ", 10)  = 10
[pid   522] write(2, "6", 1)            = 1
[pid   522] write(2, " [", 2)           = 2
[pid   522] write(2, "syscall", 7)      = 7
[pid   522] write(2, "]:\n", 3)         = 3
[pid   522] write(2, "runtime.notetsleepg", 19) = 19
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, "0xc74958", 8)     = 8
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x2faeea8", 9)    = 9
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x16", 4)         = 4
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/runtime/lock_futex.go", 39) = 39
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "205", 3)          = 3
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x42", 4)         = 4
[pid   522] write(2, " fp=", 4)         = 4
[pid   522] write(2, "0xc42002c760", 12) = 12
[pid   522] write(2, " sp=", 4)         = 4
[pid   522] write(2, "0xc42002c730", 12) = 12
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "runtime.timerproc", 17) = 17
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/runtime/time.go", 33) = 33
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "209", 3)          = 3
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x327", 5)        = 5
[pid   522] write(2, " fp=", 4)         = 4
[pid   522] write(2, "0xc42002c7e0", 12) = 12
[pid   522] write(2, " sp=", 4)         = 4
[pid   522] write(2, "0xc42002c760", 12) = 12
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "runtime.goexit", 14) = 14
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/runtime/asm_amd64.s", 37) = 37
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "2197", 4)         = 4
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x1", 3)          = 3
[pid   522] write(2, " fp=", 4)         = 4
[pid   522] write(2, "0xc42002c7e8", 12) = 12
[pid   522] write(2, " sp=", 4)         = 4
[pid   522] write(2, "0xc42002c7e0", 12) = 12
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "created by ", 11) = 11
[pid   522] write(2, "runtime.addtimerLocked", 22) = 22
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/runtime/time.go", 33) = 33
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "116", 3)          = 3
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0xed", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "goroutine ", 10)  = 10
[pid   522] write(2, "1", 1)            = 1
[pid   522] write(2, " [", 2)           = 2
[pid   522] write(2, "sleep", 5)        = 5
[pid   522] write(2, "]:\n", 3)         = 3
[pid   522] write(2, "time.Sleep", 10)  = 10
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, "0x2faf080", 9)    = 9
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/runtime/time.go", 33) = 33
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "59", 2)           = 2
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0xf9", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "github.com/weaveworks/weave/vendor/github.com/boltdb/bolt.flock", 63) = 63
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, "0xc4200f8a80", 12) = 12
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x1000001b0", 11) = 11
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x1b0", 5)        = 5
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0xc42000e470", 12) = 12
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/go/src/github.com/weaveworks/weave/vendor/github.com/boltdb/bolt/bolt_unix.go", 78) = 78
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "38", 2)           = 2
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0xdd", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "github.com/weaveworks/weave/vendor/github.com/boltdb/bolt.Open", 62) = 62
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, "0xc42028a3e0", 12) = 12
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x19", 4)         = 4
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x1b0", 5)        = 5
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0xc97b20", 8)     = 8
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x7", 3)          = 3
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0xc42028a3e0", 12) = 12
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x19", 4)         = 4
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/go/src/github.com/weaveworks/weave/vendor/github.com/boltdb/bolt/db.go", 71) = 71
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "181", 3)          = 3
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x173", 5)        = 5
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "github.com/weaveworks/weave/db.NewBoltDB", 40) = 40
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, "0x7fff58a88aef", 14) = 14
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x12", 4)         = 4
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x2", 3)          = 3
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x2", 3)          = 3
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0xc4201a9e80", 12) = 12
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/go/src/github.com/weaveworks/weave/db/boltdb.go", 48) = 48
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "38", 2)           = 2
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x9e", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "main.setDBFlag", 14) = 14
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, "0xc420010110", 12) = 12
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x3", 3)          = 3
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x3", 3)          = 3
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, ", ", 2)           = 2
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/go/src/github.com/weaveworks/weave/prog/weaveutil/db_flag.go", 61) = 61
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "44", 2)           = 2
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0xb7", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "main.main", 9)    = 9
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/go/src/github.com/weaveworks/weave/prog/weaveutil/main.go", 58) = 58
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "86", 2)           = 2
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x27b", 5)        = 5
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "goroutine ", 10)  = 10
[pid   522] write(2, "17", 2)           = 2
[pid   522] write(2, " [", 2)           = 2
[pid   522] write(2, "syscall", 7)      = 7
[pid   522] write(2, ", locked to thread", 18) = 18
[pid   522] write(2, "]:\n", 3)         = 3
[pid   522] write(2, "runtime.goexit", 14) = 14
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/runtime/asm_amd64.s", 37) = 37
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "2197", 4)         = 4
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x1", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "goroutine ", 10)  = 10
[pid   522] write(2, "5", 1)            = 1
[pid   522] write(2, " [", 2)           = 2
[pid   522] write(2, "syscall", 7)      = 7
[pid   522] write(2, "]:\n", 3)         = 3
[pid   522] write(2, "os/signal.signal_recv", 21) = 21
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/runtime/sigqueue.go", 37) = 37
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "116", 3)          = 3
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x104", 5)        = 5
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "os/signal.loop", 14) = 14
[pid   522] write(2, "(", 1)            = 1
[pid   522] write(2, ")\n", 2)          = 2
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/os/signal/signal_unix.go", 42) = 42
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "22", 2)           = 2
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x22", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "created by ", 11) = 11
[pid   522] write(2, "os/signal.init.1", 16) = 16
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "\t", 1)           = 1
[pid   522] write(2, "/usr/local/go/src/os/signal/signal_unix.go", 42) = 42
[pid   522] write(2, ":", 1)            = 1
[pid   522] write(2, "28", 2)           = 2
[pid   522] write(2, " +", 2)           = 2
[pid   522] write(2, "0x41", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rax    ", 7)      = 7
[pid   522] write(2, "0xfffffffffffffffc", 18) = 18
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rbx    ", 7)      = 7
[pid   522] write(2, "0x2faeea8", 9)    = 9
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rcx    ", 7)      = 7
[pid   522] write(2, "0x457233", 8)     = 8
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rdx    ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rdi    ", 7)      = 7
[pid   522] write(2, "0xc74958", 8)     = 8
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rsi    ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rbp    ", 7)      = 7
[pid   522] write(2, "0xc42002c6e8", 12) = 12
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rsp    ", 7)      = 7
[pid   522] write(2, "0xc42002c6a0", 12) = 12
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r8     ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r9     ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r10    ", 7)      = 7
[pid   522] write(2, "0xc42002c6d8", 12) = 12
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r11    ", 7)      = 7
[pid   522] write(2, "0x202", 5)        = 5
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r12    ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r13    ", 7)      = 7
[pid   522] write(2, "0x8", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r14    ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "r15    ", 7)      = 7
[pid   522] write(2, "0xf3", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rip    ", 7)      = 7
[pid   522] write(2, "0x457233", 8)     = 8
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "rflags ", 7)      = 7
[pid   522] write(2, "0x202", 5)        = 5
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "cs     ", 7)      = 7
[pid   522] write(2, "0x33", 4)         = 4
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "fs     ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   522] write(2, "gs     ", 7)      = 7
[pid   522] write(2, "0x0", 3)          = 3
[pid   522] write(2, "\n", 1)           = 1
[pid   524] <... futex resumed> )       = ? <unavailable>
[pid   526] <... futex resumed> )       = ? <unavailable>
[pid   524] +++ exited with 2 +++
[pid   526] +++ exited with 2 +++
[pid   523] +++ exited with 2 +++
[pid   525] +++ exited with 2 +++
+++ exited with 2 +++

lslocks didnt have any reference to weaveutil or weavedb:

core@devtoolskubernetes-kubernetes-cr0-1-1519311725 ~ $ lslocks
COMMAND           PID   TYPE SIZE MODE  M START END PATH
weaver          14741  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/b8205c3e-6170-4ba5-b472-867bf20f13e4/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/hosts...
etcd             6053  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/b8205c3e-6170-4ba5-b472-867bf20f13e4/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/hosts...
(unknown)          -1 OFDLCK      READ  0     0   0
kubelet         19446  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/b8205c3e-6170-4ba5-b472-867bf20f13e4/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/hosts...
(unknown)          -1 OFDLCK      READ  0     0   0 /var/lib/rkt/pods/run/b8205c3e-6170-4ba5-b472-867bf20f13e4/stage1/rootfs/opt/stage2/hyperkube/rootfs/dev...
locksmithd       4924  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/b8205c3e-6170-4ba5-b472-867bf20f13e4/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/resolv.conf...
kubelet         19446  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/b8205c3e-6170-4ba5-b472-867bf20f13e4/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/resolv.conf...
dockerd         21960  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/b8205c3e-6170-4ba5-b472-867bf20f13e4/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/hosts...
dogopupper commented 6 years ago

or: https://pastebin.com/kpXBa25J

brb commented 6 years ago

Thanks for the strace. I see that the weaveutil process is sleeping after an unsuccessful attempt to acquire the lock: https://github.com/boltdb/bolt/blob/c6ba97b89e0454fec9aa92e1d33a4e2c5fc1f631/bolt_unix.go#L38

From the lslocks output it looks that you run everything in rkt containers. Is it correct?

dogopupper commented 6 years ago

Yes, everything is in rkt, the k8s control plane itself also - it's self-hosted / masterless.

k8s pods run with docker.

Is there any chance docker/containerd might be the culprit here?

brb commented 6 years ago

My suspicion is that flock(2) is not working on your setup, as neither weaver or weaveutil are able to grab the lock.

Yes, everything is in rkt

Does it mean that Docker is running on top of rkt?

Also, is weaver running inside a Docker container or a rkt container?

dogopupper commented 6 years ago

hmm, kubelet runs within rkt, and all pods in docker - weaver is running in docker. hyperkube images also run in docker managed by k8s, which run the rest of the k8s control plane - scheduler etcetera.

on the coreOS VM:

 $ rkt list
UUID            APP             IMAGE NAME                                              STATE   CREATED         STARTED         NETWORKS
49125d7a        hyperkube       quay.io/coreos/hyperkube:v1.8.7_coreos.0                running 4 days ago      4 days ago
5d584892        etcd            quay.docker.tech.lastmile.com/coreos/etcd:v2.3.8        running 4 days ago      4 days ago

-- etcd here is a proxy that speaks to the actual etcd nodes, these ones are separate --

 $ docker ps | grep weave-net
d16e9dab2e03        hub.docker.tech.lastmile.com/weaveworks/weave-kube@sha256:2e37d2ea4fea7fa624bf46c42facf4afe4303a27ab1a1a84838e08a66898e610        

$ docker exec -it d16e9dab2e03 "/bin/sh"
/home/weave # ls
kube-peers        launch.sh         restart.sentinel  weave             weaver
/home/weave # lslocks
COMMAND           PID   TYPE SIZE MODE  M START END PATH
kubelet          6564  FLOCK   0B WRITE 0     0   0 /var/lib/rkt/pods/run/49125d7a-4595-4598-aa3f-e4fe4224427b/stage1/rootfs/opt/stage2/hyperkube/rootfs/run/lock/kubelet.lock
nrsysmond       16589  POSIX   0B WRITE 0     0   0 /sys/fs/cgroup/cpuset
(unknown)          -1 OFDLCK   0B READ  0     0   0
kubelet          6564  FLOCK   4K WRITE 0     0   0 /var/lib/rkt/pods/run/49125d7a-4595-4598-aa3f-e4fe4224427b
dockerd           911  FLOCK  32K WRITE 0     0   0 /var/lib/docker/volumes/metadata.db
weaver          11718  FLOCK  64K WRITE 0     0   0 /weavedb/weave-netdata.db
runsv            8774  FLOCK   0B WRITE 0     0   0 /sys/fs/cgroup/cpuset
locksmithd       5778  FLOCK  45B WRITE 0     0   0 /run/update-engine/coordinator.conf
etcd             6381  FLOCK   4K WRITE 0     0   0 /var/lib/rkt/pods/run/5d584892-6b2d-4ba1-a497-453e8083b32c
(unknown)          -1 OFDLCK   0B READ  0     0   0
(unknown)          -1 OFDLCK   0B READ  0     0   0
runsv            8773  FLOCK   0B WRITE 0     0   0 /sys/fs/cgroup/cpuset

that's the lslocks output from within the weave docker container - it has locks to rkt kubelet land like /var/lib/rkt/pods/run/49125d7a-4595-4598-aa3f-e4fe4224427b/stage1/rootfs/opt/stage2/hyperkube/rootfs/run/lock/kubelet.lock

Perhaps something's going wrong between all these, doesn't look like the simplest solution overall. I didn't create this, don't know much about the why and hows..

rajatjindal commented 6 years ago

we are running into same issue as well after updating the our cluster few days back. In our case, only one node is the problem, all other nodes work just fine.

Our setup is:

Kubernetes: 1.8.5 AWS (deployed using kops 1.8.1) Docker version: 1.13.1 Ubuntu: 16.04.3 LTS Weave: 2.2.0

output from strace command:

strace -e write -s100000 -fp 8393 strace: Process 8393 attached with 17 threads [pid 8608] write(2, "WARN: 2018/03/02 04:30:33.558934 [allocator]: Delete: no addresses for e3cc08d8afd4c957e7c26ade74f8204955633df4de256360f8555e6004d66936\n", 136

bboreham commented 6 years ago

@rajatjindal please open a new issue and post the full logs.

brb commented 6 years ago

@dogopupper

  1. What overlay fs is used for rkt and Docker containers?
  2. When the issue hits again, could you run on the affected node (not from the weave container): ps -eLf > /tmp/foo && lslocks >> /tmp/foo && lsof >> /tmp/foo and upload /tmp/foo?
dogopupper commented 6 years ago

docker uses overlay (1), rkt is vanilla/unconfigured as-is with coreos, coreos site says it defaults to 'overlayfs'.

overlay2 should be better I suppose?

i'll post the outputs when they're available.

Godley commented 6 years ago

Issue came back, here's the output from the command you requested on one unhealthy node. There's 6/11 crashing at the mo. output.txt

brb commented 6 years ago

Thanks! Could you run lsof from root?

Godley commented 6 years ago

I'm going to leave this to @dogopupper, he's more up to speed on this than I am.

dogopupper commented 6 years ago

Got all outputs this time.

from lslocks:

COMMAND           PID   TYPE SIZE MODE  M START END PATH
kubelet          3741  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/resolv.conf...
dockerd         14880  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/docker/overlay...
(unknown)          -1 OFDLCK      READ  0     0   0
kubelet          3741  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/docker/overlay...
(unknown)          -1 OFDLCK      READ  0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/dev...
locksmithd       5528  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/resolv.conf...
etcd             6160  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/docker/overlay...
flock           15886  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/etc/resolv.conf...
weaver          28520  FLOCK      WRITE 0     0   0 /var/lib/rkt/pods/run/9ddf2c83-0331-4cb3-a7da-a37385903322/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/docker/overlay...
(unknown)          -1 OFDLCK      READ  0     0   0

the locks in docker/overlay and etc/resolv.conf within the rkt hyperkube (kubelet) container are interesting and apparent in all nodes that are malfunctioning and weave gets in a crashloop (error syncing pod, probes failing etc) - logs of weave container have the same symptom as the title.

that whole rkt land path and docker overlay within rkt look interesting - the rkt container itself does not have a docker cli however by manually grepping the contents of /var/lib/docker/containers from within the rkt kubelet container, the contents were indeed the filesystems of the containers that run in pods that run in kubernetes itself. as in, weaver lives in /var/lib/rkt/pods/run/*hyperkube-kubelet-container*/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/docker/overlay/*podnumber*/home/weave/weaver .

the docker service itself runs in the node.

observing the behaviour today, the weaveutil lock comes and goes but the crashloop remains - it doesn't seem so stuck on weaveutil - there was previously massive i/o wait in the etcd nodes, that might have contributed in general I think. I patched etcd by increasing heartbeat intervals and upgrading to latest 3.3.1, however there is indeed slow I/O on this cluster.

here's the full output I got a few hours ago (lsof from root, weaveutil present) outputsfull.txt

rajatjindal commented 6 years ago

In our case it turned out prometheus was consuming all the resources available on that node (potentially because we configured a remote write on prometheus which was down and thus prometheus was queuing up)

Once i disabled remote write, everything is working since.

Edited:

found this issue in Prometheus issue list: https://github.com/prometheus/prometheus/issues/3809

bboreham commented 6 years ago

@rajatjindal that's a great insight - we sometimes see Prometheus chewing up all memory on other clusters, but we don't really understand the mechanism since it's supposed to drop data when the queue gets over a certain size.

Did you log an issue with Prometheus?

dogopupper commented 6 years ago

resource utilization is very low in this specific cluster - about 2-20% cpu and 5-30% memory. there's an occassional 2-15% i/o wait in the k8s nodes every 30 secs or so.

dogopupper commented 6 years ago

I'll post some systemd units that might be related, on node configuration:

    **- name: kubelet.service**
      enabled: true
      contents: |
        [Service]
        EnvironmentFile=-/run/metadata/coreos
        EnvironmentFile=/etc/docker-environment
        Environment="RKT_RUN_ARGS= \
          --uuid-file-save=/var/cache/kubelet-pod.uuid \
          --volume=etc-cni,kind=host,source=/etc/cni,readOnly=true \
          --mount volume=etc-cni,target=/etc/cni \
          --volume=opt-cni,kind=host,source=/opt/cni,readOnly=true \
          --mount volume=opt-cni,target=/opt/weave-net \
          --volume=resolv,kind=host,source=/etc/resolv.conf,readOnly=true \
          --mount volume=resolv,target=/etc/resolv.conf \
          --volume=var-log,kind=host,source=/var/log,readOnly=false \
          --mount volume=var-log,target=/var/log \
        "
        Environment=IPTABLES_LOCK_FILE=/run/xtables.lock
        Environment=KUBELET_ACI=quay.io/coreos/hyperkube
        Environment=KUBELET_VERSION={{kubernetes_version}}
        # iptables lock file must be created prior to containers trying to mount it
        ExecStartPre=/usr/bin/touch ${IPTABLES_LOCK_FILE}
        ExecStartPre=/usr/bin/chmod 0600 ${IPTABLES_LOCK_FILE}
        ExecStartPre=/usr/bin/chown root:root ${IPTABLES_LOCK_FILE}
        ExecStartPre=/bin/mkdir -p /etc/cni
        ExecStartPre=/bin/mkdir -p /opt/cni
        ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
        ExecStartPre=/bin/mkdir -p /srv/kubernetes/manifests
        ExecStartPre=/bin/mkdir -p /etc/kubernetes/checkpoint-secrets
        **ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/cache/kubelet-pod.uuid**
        # Download a copy of the aci from our local mirror, due to really slow http_proxy
        {%- if cloud_provider == 'openstack' %}
        ExecStartPre=/bin/rkt fetch --insecure-options=image P_S3_ENDPOINT_URL/kubernetes-bootstrap/acis/hyperkube/{{kubernetes_version}}.aci
        {% endif %}
        ExecStart=/bin/bash -c '/usr/lib/coreos/kubelet-wrapper \
          --kubeconfig=/etc/kubernetes/kubeconfig \
          --require-kubeconfig \
          --lock-file=/var/run/lock/kubelet.lock \
          --exit-on-lock-contention \
          --pod-manifest-path=/etc/kubernetes/manifests \
          --allow-privileged \
          {%- if cloud_provider == 'openstack' %}
          --cloud-provider=openstack \
          {%- else %}
          --cloud-provider=gce \
          {%- endif %}
          --cloud-config=/etc/kubernetes/cloud_config \
          --node-labels=master=true \
          --minimum-container-ttl-duration=6m0s \
          --cluster_dns=172.31.128.2 \
          --cluster_domain=cluster.local \
          --network-plugin=cni \
          --pod-infra-container-image=${GCR_DOCKER_MIRROR}/google_containers/pause-amd64:3.0 \
          --enforce-node-allocatable=pods \
          **--kube-reserved=cpu=250m,memory=1Gi,ephemeral-storage=20Gi \
          --system-reserved=cpu=100m,memory=600Mi,ephemeral-storage=2Gi \
          --eviction-hard=memory.available\\<500Mi,nodefs.available\\<10% \
          '
        ExecStop=-/usr/bin/rkt stop --uuid-file=/var/cache/kubelet-pod.uuid**
        Restart=always
        RestartSec=10
        Slice=podruntime.slice

  **17     - name: podruntime.slice**
  18       contents: |
  19         [Unit]
  20         Description=Pod Runtime Slice
  21         Documentation=https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#recommended-cgroups-setup

  **59     - name: containerd.service**
  60       dropins:
  61       - name: 50-containerd-opts.conf
  62         contents: |
  63           [Service]
  64           Slice=podruntime.slice

  **65     - name: docker.service**
  66       dropins:
  67       - name: 50-docker-opts.conf
  68         contents: |
  69           [Unit]
  70           After=update-ca-certificates.service
  71           Requires=update-ca-certificates.service
  72           [Service]
  73           {%- if cloud_provider == 'openstack' %}
  74           Environment=http_proxy=http://proxy.ocado.com:8080
  75           Environment=https_proxy=http://proxy.ocado.com:8080
  76           Environment=no_proxy=.ocado.com,.lastmile.com
  77           {%- endif %}
  78           # disabling selinux for now due to moby/moby#32892
  79           Environment="DOCKER_OPTS=--log-opt max-size={{log_size}} --log-opt max-file=2 --bip=172.31.255.1/24 --storage-driver=overlay2 --selinux-enabled=false"
  80           Slice=podruntime.slice

  - filesystem: "root"
    **path: /var/lib/docker-healthcheck**
    mode: 0755
    contents:
      inline: |
        #!/bin/bash
        # Taken from https://github.com/kubernetes/kops/blob/2b723b2a649402b00f45565158011384674449f4/vendor/k8s.io/kubernetes/cluster/saltbase/salt/docker/docker-healthcheck
        # This script is intended to be run periodically, to check the health
        # of docker.  If it detects a failure, it will restart docker using systemctl.
        set -x
        return_code=1
        if timeout 10 docker ps > /dev/null; then
            echo "docker healthcheck script started - docker ok."
            return_code=0
        else
            echo "docker failed"
            echo "Giving docker 30 seconds grace before restarting"
            sleep 30

            if timeout 10 docker ps > /dev/null; then
                echo "docker recovered"
                return_code=0
            else
                echo "docker still down; triggering docker restart"
                timeout 60 systemctl stop docker
                timeout 20 systemctl stop -f docker
                timeout 60 systemctl stop containerd
                timeout 20 systemctl stop -f containerd
                timeout 60 systemctl start containerd
                timeout 60 systemctl start docker
                echo "Waiting 60 seconds to give docker time to start"
                sleep 60
                if timeout 10 docker ps > /dev/null; then
                    echo "docker recovered"
                    return_code=0
                else
                    echo "docker still failing"
                fi
            fi
        fi
        echo "docker healthcheck complete (return code: $return_code)"
        exit $return_code
rajatjindal commented 6 years ago

@dogopupper i think another issue here might be that if at some point system was under load and impacted weave, weave might have failed to recover from that state?

mikebryant commented 6 years ago

devtoolskubernetes-kubernetes-cr0-1-1519311725.txt

There's some interesting features here (excerpts from the above log)

A weaver process, that's been reparented to init:

root     16424     1 16424  0   26 Mar07 ?        00:00:00 /home/weave/weaver --port=6783 --datapath=datapath --name=9a:60:a6:d9:ce:a8 --host-root=/host --http-addr=127.0.0.1:6784 --status-addr=0.0.0.0:67
82 --docker-api= --no-dns --db-prefix=/weavedb/weave-net --ipalloc-range=172.31.0.0/17 --nickname=devtoolskubernetes-kubernetes-cr0-1-1519311725 --ipalloc-init consensus=11 --conn-limit=100 --expect-npc 1
0.118.10.112 10.118.10.113 10.118.10.114 10.118.10.108 10.118.10.109 10.118.10.110 10.118.10.111 10.118.10.104 10.118.10.105 10.118.10.107 10.118.10.106

A lock, held by the above weaver:

weaver          16424  FLOCK  64K WRITE 0     0   0 /weavedb/weave-netdata.db

I wonder how that weaver has managed to survive the termination of it's associated docker container.

I tried obtaining the current strace of the weaver process, (using the above SIGQUIT), however strace hung (it wouldn't respond to control+c).

The weaver process actually refuses to die:

devtoolskubernetes-kubernetes-cr0-1-1519311725 core # ps -ef | grep -i weaver
root     16270  7109  0 17:58 pts/0    00:00:00 grep --colour=auto -i weaver
root     16424     1  2 Mar07 ?        00:35:22 /home/weave/weaver --port=6783 --datapath=datapath --name=9a:60:a6:d9:ce:a8 --host-root=/host --http-addr=127.0.0.1:6784 --status-addr=0.0.0.0:6782 --docker-api= --no-dns --db-prefix=/weavedb/weave-net --ipalloc-range=172.31.0.0/17 --nickname=devtoolskubernetes-kubernetes-cr0-1-1519311725 --ipalloc-init consensus=11 --conn-limit=100 --expect-npc 10.118.10.112 10.118.10.113 10.118.10.114 10.118.10.108 10.118.10.109 10.118.10.110 10.118.10.111 10.118.10.104 10.118.10.105 10.118.10.107 10.118.10.106
devtoolskubernetes-kubernetes-cr0-1-1519311725 core # kill -9 16424
devtoolskubernetes-kubernetes-cr0-1-1519311725 core # ps -ef | grep -i weaver
root     16296  7109  0 17:58 pts/0    00:00:00 grep --colour=auto -i weaver
root     16424     1  2 Mar07 ?        00:35:22 /home/weave/weaver --port=6783 --datapath=datapath --name=9a:60:a6:d9:ce:a8 --host-root=/host --http-addr=127.0.0.1:6784 --status-addr=0.0.0.0:6782 --docker-api= --no-dns --db-prefix=/weavedb/weave-net --ipalloc-range=172.31.0.0/17 --nickname=devtoolskubernetes-kubernetes-cr0-1-1519311725 --ipalloc-init consensus=11 --conn-limit=100 --expect-npc 10.118.10.112 10.118.10.113 10.118.10.114 10.118.10.108 10.118.10.109 10.118.10.110 10.118.10.111 10.118.10.104 10.118.10.105 10.118.10.107 10.118.10.106

Indeed, it's in uninterruptible sleep:

16424 ?        Dl    35:22 /home/weave/weaver --port=6783 --datapath=datapath --name=9a:60:a6:d9:ce:a8 --host-root=/host --http-addr=127.0.0.1:6784 --status-addr=0.0.0.0:6782 --docker-api= --no-dns --db-prefix=/weavedb/weave-net --ipalloc-range=172.31.0.0/17 --nickname=devtoolskubernetes-kubernetes-cr0-1-1519311725 --ipalloc-init consensus=11 --conn-limit=100 --expect-npc 10.118.10.112 10.118.10.113 10.118.10.114 10.118.10.108 10.118.10.109 10.118.10.110 10.118.10.111 10.118.10.104 10.118.10.105 10.118.10.107 10.118.10.106

Kernel stack:

devtoolskubernetes-kubernetes-cr0-1-1519311725 fd # cat /proc/16424/stack
[<ffffffff920e1923>] __refrigerator+0x73/0x160
[<ffffffff92084c76>] get_signal+0x5c6/0x5d0
[<ffffffff9202aa36>] do_signal+0x36/0x610
[<ffffffff92003011>] exit_to_usermode_loop+0x71/0xb0
[<ffffffff920039d9>] do_syscall_64+0xe9/0x1c0
[<ffffffff92800115>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

cgroup frozen status:

devtoolskubernetes-kubernetes-cr0-1-1519311725 16424 # cat /sys/fs/cgroup/freezer/kubepods/burstable/podfe44ab7d-1bcb-11e8-8c21-fa163e8e4409/35f7326fc1bee03c4f669569eb91d9241d7659060be582e5740bdaab9911544d/freezer.state 
FROZEN
devtoolskubernetes-kubernetes-cr0-1-1519311725 16424 # echo THAWED > /sys/fs/cgroup/freezer/kubepods/burstable/podfe44ab7d-1bcb-11e8-8c21-fa163e8e4409/35f7326fc1bee03c4f669569eb91d9241d7659060be582e5740bdaab9911544d/freezer.state 
# And it finally dies!
dogopupper commented 6 years ago

@rajatjindal that's unlikely - there are (aggressive) resource limits everywhere and the cluster doesn't run anything resource intensive.. e.g. the most resource hungry thing running is Weave Scope

brb commented 6 years ago

@mikebryant Great findings! Any idea how the weaver process got frozen (maybe there are some entries in journald or dmesg logs)?

brb commented 6 years ago

Is this still an issue?

stuart-warren commented 5 years ago

We (@mikebryant and I) have experienced this again today, currently just restarting nodes to get past it. All nodes were in one openstack az, so could be some disk slowness issue triggering it?

brb commented 5 years ago

@stuart-warren Do you have any logs from the event?

stuart-warren commented 5 years ago

@brb I wasn't able to capture anything new that isn't already mentioned here

jaygorrell commented 5 years ago

I found this thread when searching for [boltDB] Unable to open /weavedb/weave-netdata.db: timeout but in my case, @rajatjindal seemed to be on the right track.

After load testing our environment (kops-based aws, 12 nodes, not much else of note), we'll often have 1 or 2 nodes in this state where weave is in a crash loop with the above error. We just kill the node when it happens but it'd be nice for something more automated.