weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Marathon/Mesos unable to schedule containers: "/w/w": stat /w/w: no such file or directory" #3572

Open jamesbibby opened 5 years ago

jamesbibby commented 5 years ago

What you expected to happen?

Marathon/Mesos is able to schedule containers on servers on servers running weave 2.5.0.

What happened?

After about 24 hours of operation, containers fail to start containers on any hosts running weave 2.5.0

How to reproduce it?

We are running a services cluster on AWS withsome older versions of mesos/marathon/weave/docker:

Ubuntu: Trusty (Kernel 4.4.0) Marathon 0.11.1 Mesos: 0.25.0 Docker 1.9.1-0~trusty Weave: 1.4.2 Registrator: Snapshotted from Jan 2016.

I am in the process of upgrading the cluster with new(er) versions of Docker, Weave and Registrator: Docker: 1.11.2-0~trusty Weave: 2.5.0 (using the proxy) Registrator: v7

I tested things in a dev cluster and the upgrade worked in place with the following process: disable mesos slave process, upgrade and restart docker, upgrade, setup and restart weave, upgrade registrator. When I started canarying the upgrade in production it failed after about 24 hours with the following error message:

time="2018-12-21T12:57:02.164498422Z" level=error msg="attach: stdout: write unix /var/run/docker.sock->@: write: broken pipe" 
time="2018-12-21T12:57:14.229753068Z" level=error msg="Handler for GET /v1.23/containers/redacted/millennium-falcon:64de140/json returned error: No such container: redacted/millennium-falcon:64de140" 
time="2018-12-21T12:57:14.375146770Z" level=error msg="Handler for GET /v1.23/containers/mesos-22af75c3-16fd-4540-a7d2-61be362b6e09-S3.947314ff-e723-4f7e-b721-0592323cf54c/json returned error: No such container: mesos-22af75c3-16fd-4540-a
7d2-61be362b6e09-S3.947314ff-e723-4f7e-b721-0592323cf54c" 
time="2018-12-21T12:57:14.376004753Z" level=error msg="Handler for GET /v1.23/images/mesos-22af75c3-16fd-4540-a7d2-61be362b6e09-S3.947314ff-e723-4f7e-b721-0592323cf54c/json returned error: No such image: mesos-22af75c3-16fd-4540-a7d2-61be
362b6e09-S3.947314ff-e723-4f7e-b721-0592323cf54c" 
ERRO[67581] containerd: start container                   error=oci runtime error: exec: "/w/w": stat /w/w: no such file or directory id=0f29cbe214ee6c358ad1b6be4728835c37852b6ecbc37f508be67278e5b3c369

From my reading this means that the weave wait volume is no longer found and the entry point can no longer be accessed in the container, but I would like confirmation of this.

I created a dev container with one mesos slave upgraded (and the rest of the cluster running the original versions) and used it for about 24 hours (suspending and restarting applications periodically). I have been able to get the cluster into the broken state and would like help in debugging/resolving this issue.

Anything else we need to know?

Versions:

$ weave version
weave script 2.5.0
weave 2.5.0

$ docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:47:50 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:47:50 2016
 OS/Arch:      linux/amd64

$ uname -a
Linux ip-10-0-5-201 4.4.0-1014-aws #14-Ubuntu SMP Thu Feb 15 21:53:17 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ mesos-slave --version
mesos 0.25.0

Logs:

From what I can see in the logs weave shut down at some point and errors seemed to start after this occurred:

$ docker logs weave
INFO: 2018/12/20 18:10:39.826421 ->[10.0.4.115:26736|72:b0:4f:fe:84:77(ip-10-0-4-115)]: connection added
INFO: 2018/12/20 18:10:39.834883 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2018/12/20 18:10:39.834950 overlay_switch ->[72:b0:4f:fe:84:77(ip-10-0-4-115)] using sleeve
INFO: 2018/12/20 18:10:39.834973 ->[10.0.4.115:26736|72:b0:4f:fe:84:77(ip-10-0-4-115)]: connection fully established
INFO: 2018/12/20 18:10:39.843661 sleeve ->[10.0.4.115:6783|72:b0:4f:fe:84:77(ip-10-0-4-115)]: Effective MTU verified at 8939
INFO: 2018/12/20 18:10:40.315800 overlay_switch ->[ba:ff:d0:c2:f7:d3(ip-10-0-5-93)] using fastdp
INFO: 2018/12/20 18:10:40.329458 overlay_switch ->[72:b0:4f:fe:84:77(ip-10-0-4-115)] using fastdp
INFO: 2018/12/20 18:10:41.953343 ->[127.0.0.1:38300] connection accepted
INFO: 2018/12/20 18:10:41.953534 ->[127.0.0.1:38300] connection shutting down due to error during handshake: failed to receive remote protocol header: EOF
ERRO: 2018/12/20 18:10:42.562938 Error forwarding chunked response body: write unix /var/run/weave/weave.sock->@: write: broken pipe
INFO: 2018/12/20 18:10:42.661976 [net] DeleteNetwork 09b0a78611b61b26ae767f151b03800ae76e95b9bb1aa19bf6b9b4c7041fcf06
INFO: 2018/12/20 18:10:42.662416 [ipam] ReleaseAddress weave-10.32.0.0/12-10.32.0.0/12 10.32.0.0
INFO: 2018/12/20 18:10:42.663469 [ipam] ReleasePool weave-10.32.0.0/12-10.32.0.0/12
INFO: 2018/12/20 18:10:42.685808 === received SIGINT/SIGTERM ===
*** exiting
Weave is not running (ignore on Kubernetes).
dd194781ec22df64b2369c64a3d616524f61c6ae93efaf07d9274394e7630f4f
10.34.64.0
INFO: 2018/12/20 18:10:55.562925 Command line options: map[proxy:true weave-bridge:weave plugin:true port:6783 no-dns:true resolv-conf:/var/run/weave/etc/resolv.conf ipalloc-range:10.32.0.0/12 nickname:ip-10-0-5-201 host-root:/host http-addr:127.0.0.1:6784 status-addr:127.0.0.1:6782 without-dns:true datapath:datapath docker-bridge:docker0 rewrite-inspect:true dns-listen-address:172.17.0.1:53 H:[unix:///var/run/weave/weave.sock]]
INFO: 2018/12/20 18:10:55.562994 weave  2.5.0
INFO: 2018/12/20 18:10:55.564301 Docker API on unix:///var/run/docker.sock: &[BuildTime=2016-06-01T21:47:50.269346868+00:00 Version=1.11.2 ApiVersion=1.23 GitCommit=b9f10c9 GoVersion=go1.5.4 Os=linux Arch=amd64 KernelVersion=4.4.0-1014-aws]
INFO: 2018/12/20 18:10:55.568567 proxy listening on unix:///var/run/weave/weave.sock
INFO: 2018/12/20 18:10:55.652296 Re-exposing 10.34.64.0/12 on bridge "weave"
INFO: 2018/12/20 18:10:55.704461 Bridge type is bridged_fastdp
INFO: 2018/12/20 18:10:55.704638 Communication between peers is unencrypted.
INFO: 2018/12/20 18:10:55.725828 Our name is e2:ad:cf:10:b7:19(ip-10-0-5-201)
INFO: 2018/12/20 18:10:55.725865 Launch detected - using supplied peer list: [10.0.6.226 10.0.4.187 10.0.4.39 10.0.5.93 10.0.5.150 10.0.4.198 10.0.6.142 10.0.5.198 10.0.4.162 10.0.5.207 10.0.4.115 10.0.6.199 10.0.2.121]
INFO: 2018/12/20 18:10:55.726868 Docker API on unix:///var/run/docker.sock: &[KernelVersion=4.4.0-1014-aws BuildTime=2016-06-01T21:47:50.269346868+00:00 Version=1.11.2 ApiVersion=1.23 GitCommit=b9f10c9 GoVersion=go1.5.4 Os=linux Arch=amd64]
INFO: 2018/12/20 18:10:55.727368 Checking for pre-existing addresses on weave bridge
INFO: 2018/12/20 18:10:55.727613 weave bridge has address 10.34.64.0/12
INFO: 2018/12/20 18:10:55.770649 [allocator e2:ad:cf:10:b7:19] Initialising with persisted data
INFO: 2018/12/20 18:10:55.774930 Sniffing traffic on datapath (via ODP)
INFO: 2018/12/20 18:10:55.784529 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2018/12/20 18:10:55.785433 Listening for status+metrics requests on 127.0.0.1:6782
INFO: 2018/12/20 18:10:55.803306 ->[10.0.6.226:6783] attempting connection
INFO: 2018/12/20 18:10:55.804065 ->[10.0.2.121:6783] attempting connection
INFO: 2018/12/20 18:10:55.804224 ->[10.0.4.198:6783] attempting connection
INFO: 2018/12/20 18:10:55.804374 ->[10.0.5.93:6783] attempting connection
INFO: 2018/12/20 18:10:55.804547 ->[10.0.4.162:6783] attempting connection
INFO: 2018/12/20 18:10:55.804709 ->[10.0.5.207:6783] attempting connection
INFO: 2018/12/20 18:10:55.804902 ->[10.0.5.198:6783] attempting connection
INFO: 2018/12/20 18:10:55.805076 ->[10.0.4.187:6783] attempting connection
INFO: 2018/12/20 18:10:55.805315 ->[10.0.4.39:6783] attempting connection
INFO: 2018/12/20 18:10:55.805500 ->[10.0.4.115:6783] attempting connection
INFO: 2018/12/20 18:10:55.805682 ->[10.0.6.199:6783] attempting connection
INFO: 2018/12/20 18:10:55.805893 ->[10.0.6.142:6783] attempting connection
INFO: 2018/12/20 18:10:55.806230 Waiting for Weave API Server...
INFO: 2018/12/20 18:10:55.806642 ->[10.0.5.150:6783] attempting connection
INFO: 2018/12/20 18:10:55.814337 ->[10.0.6.226:6783|46:76:e3:b8:59:a8(ip-10-0-6-226)]: connection ready; using protocol version 2
INFO: 2018/12/20 18:10:55.814539 ->[10.0.4.115:6783|72:b0:4f:fe:84:77(ip-10-0-4-115)]: connection ready; using protocol version 2
INFO: 2018/12/20 18:10:55.814618 overlay_switch ->[72:b0:4f:fe:84:77(ip-10-0-4-115)] using fastdp
INFO: 2018/12/20 18:10:55.814681 ->[10.0.4.115:6783|72:b0:4f:fe:84:77(ip-10-0-4-115)]: connection added (new peer)
INFO: 2018/12/20 18:10:55.814981 ->[10.0.2.121:6783|5e:48:76:36:85:65(ip-10-0-2-121)]: connection ready; using protocol version 2
INFO: 2018/12/20 18:10:55.815086 ->[10.0.6.199:6783|a6:7e:6f:68:90:3d(ip-10-0-6-199)]: connection ready; using protocol version 2
INFO: 2018/12/20 18:10:55.815173 overlay_switch ->[5e:48:76:36:85:65(ip-10-0-2-121)] using fastdp
INFO: 2018/12/20 18:10:55.815280 ->[10.0.2.121:6783|5e:48:76:36:85:65(ip-10-0-2-121)]: connection added (new peer)
INFO: 2018/12/20 18:10:55.815444 ->[10.0.6.142:6783|76:9a:1b:e6:85:8d(ip-10-0-6-142)]: connection ready; using protocol version 2
INFO: 2018/12/20 18:10:55.815539 overlay_switch ->[76:9a:1b:e6:85:8d(ip-10-0-6-142)] using fastdp
INFO: 2018/12/20 18:10:55.815582 ->[10.0.6.142:6783|76:9a:1b:e6:85:8d(ip-10-0-6-142)]: connection added (new peer)
INFO: 2018/12/20 18:10:55.815901 ->[10.0.4.162:6783|46:e0:91:b0:c1:0a(ip-10-0-4-162)]: connection ready; using protocol version 2
INFO: 2018/12/20 18:10:55.816035 overlay_switch ->[46:e0:91:b0:c1:0a(ip-10-0-4-162)] using fastdp
INFO: 2018/12/20 18:10:55.816074 ->[10.0.5.150:6783|8a:41:16:d8:6a:0f(ip-10-0-5-150)]: connection ready; using protocol version 2

I can provide full logs include before and after the upgrade if this is helpful.

Network:

$ ip route
10.0.5.0/24 dev eth0  proto kernel  scope link  src 10.0.5.201 
10.32.0.0/12 dev weave  proto kernel  scope link  src 10.34.64.0 
172.17.0.0/16 dev docker0  proto kernel  scope link  src 172.17.0.1
$ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 10.0.5.201/24 brd 10.0.5.255 scope global eth0\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
8: weave    inet 10.34.64.0/12 scope global weave\       valid_lft forever preferred_lft forever
$ sudo iptables-save
# Generated by iptables-save v1.4.21 on Fri Dec 21 13:31:21 2018
*nat
:PREROUTING ACCEPT [5:300]
:INPUT ACCEPT [5:300]
:OUTPUT ACCEPT [19:1156]
:POSTROUTING ACCEPT [19:1156]
:DOCKER - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 5511 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 31905 -j DNAT --to-destination 172.17.0.2:5511
-A WEAVE -s 10.32.0.0/12 -d 224.0.0.0/4 -j RETURN
-A WEAVE ! -s 10.32.0.0/12 -d 10.32.0.0/12 -j MASQUERADE
-A WEAVE -s 10.32.0.0/12 ! -d 10.32.0.0/12 -j MASQUERADE
COMMIT
# Completed on Fri Dec 21 13:31:21 2018
# Generated by iptables-save v1.4.21 on Fri Dec 21 13:31:21 2018
*filter
:INPUT ACCEPT [240:56738]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [243:47836]
:DOCKER - [0:0]
:DOCKER-ISOLATION - [0:0]
:WEAVE-EXPOSE - [0:0]
-A INPUT -d 172.17.0.1/32 -i docker0 -p tcp -m tcp --dport 6783 -j DROP
-A INPUT -d 172.17.0.1/32 -i docker0 -p udp -m udp --dport 6783 -j DROP
-A INPUT -d 172.17.0.1/32 -i docker0 -p udp -m udp --dport 6784 -j DROP
-A INPUT -i docker0 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -i docker0 -p tcp -m tcp --dport 53 -j ACCEPT
-A FORWARD -j DOCKER-ISOLATION
-A FORWARD -i docker0 -o weave -j DROP
-A FORWARD -i weave -o weave -j ACCEPT
-A FORWARD -o weave -j WEAVE-EXPOSE
-A FORWARD -i weave ! -o weave -j ACCEPT
-A FORWARD -o weave -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5511 -j ACCEPT
-A DOCKER-ISOLATION -j RETURN
-A WEAVE-EXPOSE -d 10.32.0.0/12 -j ACCEPT
COMMIT
# Completed on Fri Dec 21 13:31:21 2018
jamesbibby commented 5 years ago

One additional note, restarting docker resolves the issue (temporarily I think but will confirm).

After running a sudo docker service restart containers can be scheduled again.

MikeMichel commented 5 years ago

do you run a clean job which might removes important weave volumes? afaik dc/os introduced a cleaner in one of it's versions.

jtagcat commented 4 years ago

After experiments with overnode, with similar outputs. It seems that SELinux might be your problem as well. Related: #1574 and #2864

Outputs for journalctl -t setroubleshoot and journalctl -u docker.service should give some info.