some packet loss and other weirdness when running weave in NAT'ed VMs

MongoDB host machine: dell R620 server (40 cores, 128GB ram) Ubuntu 14.04, Docker 1.4.1 uname -a:

Linux london 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Client machine: Ubuntu 14.04 VMWare Fusion 7.1.1 VM (hardware version 11) (provisioned 12GB RAM + 8 cores), Docker 1.5.0 uname -a:

Linux ubuntu 3.13.0-46-generic #77-Ubuntu SMP Mon Mar 2 18:23:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Both are using the latest weave release:

sudo weave version
weave script 0.9.0
weave router 0.9.0
weave DNS 0.9.0
weave tools 0.9.0

The client is able to insert small documents into MongoDB over the weave network, but unable to insert large documents, output from sudo tcpdump -i weave on the client machine shows messages about "need to frag"

Sample server setup (server physical IP is 192.168.130.95):

sudo weave stop
sudo weave reset
sudo weave launch
sudo weave expose 10.0.9.101/24
sudo weave run 10.0.9.19/24 -d -t --name=mongo1 -p :27017:27017 mongo:3.0.0 mongod --storageEngine wiredTiger

Client setup:

sudo weave stop
sudo weave reset
sudo weave launch 192.168.130.95
sudo weave expose 10.0.9.102/24

Small write over weave network from client:

docker run -it --rm mongo:3.0.0 mongo "10.0.9.19:27017/test" --eval "db.somecollection.insert({ a: 'bc'});"
MongoDB shell version: 3.0.0-rc8
connecting to: 10.0.9.19:27017/test
WriteResult({ "nInserted" : 1 })

Large write over the weave network from client (hangs):

docker run -it --rm mongo:3.0.0 mongo "10.0.9.19:27017/test" --eval "db.somecollection.insert({ a: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'});"
MongoDB shell version: 3.0.0-rc8
connecting to: 10.0.9.19:27017/test

Same large write over the physical network (works fine):

docker run -it --rm mongo:3.0.0 mongo "192.168.130.95:27017/test" --eval "db.somecollection.insert({ a: 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'});"
MongoDB shell version: 3.0.0-rc8
connecting to: 192.168.130.95:27017/test
WriteResult({ "nInserted" : 1 })

Output from client sudo tcpdump -i weave

listening on weave, link-type EN10MB (Ethernet), capture size 65535 bytes
10:50:55.405307 ARP, Request who-has 10.0.9.19 tell 10.0.9.102, length 28
10:50:55.407795 ARP, Reply 10.0.9.19 is-at be:1b:9e:e8:24:8e (oui Unknown), length 28
10:50:55.407832 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [S], seq 164960364, win 29200, options [mss 1460,sackOK,TS val 20069498 ecr 0,nop,wscale 7], length 0
10:50:55.409096 IP 10.0.9.19.27017 > 172.17.0.24.47892: Flags [S.], seq 2370793204, ack 164960365, win 43690, options [mss 65495,sackOK,TS val 15892260 ecr 20069498,nop,wscale 7], length 0
10:50:55.409137 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], ack 2370793205, win 229, options [nop,nop,TS val 20069499 ecr 15892260], length 0
10:50:55.409334 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [P.], seq 0:60, ack 1, win 229, options [nop,nop,TS val 20069499 ecr 15892260], length 60
10:50:55.410562 IP 10.0.9.19.27017 > 172.17.0.24.47892: Flags [.], ack 61, win 342, options [nop,nop,TS val 15892261 ecr 20069499], length 0
10:50:55.411040 IP 10.0.9.19.27017 > 172.17.0.24.47892: Flags [P.], seq 1:80, ack 61, win 342, options [nop,nop,TS val 15892261 ecr 20069499], length 79
10:50:55.411074 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], ack 80, win 229, options [nop,nop,TS val 20069499 ecr 15892261], length 0
10:50:55.412388 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [P.], seq 60:122, ack 80, win 229, options [nop,nop,TS val 20069500 ecr 15892261], length 62
10:50:55.414015 IP 10.0.9.19.27017 > 172.17.0.24.47892: Flags [P.], seq 80:274, ack 123, win 342, options [nop,nop,TS val 15892261 ecr 20069500], length 194
10:50:55.416389 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [P.], seq 122:955, ack 274, win 237, options [nop,nop,TS val 20069501 ecr 15892261], length 833
10:50:55.416545 IP 10.0.9.19 > 172.17.0.24: ICMP 10.0.9.19 unreachable - need to frag (mtu 490), length 36
10:50:55.416597 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], seq 122:622, ack 274, win 237, options [nop,nop,TS val 20069501 ecr 15892261], length 500
10:50:55.416602 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [P.], seq 622:955, ack 274, win 237, options [nop,nop,TS val 20069501 ecr 15892261], length 333
10:50:55.416633 IP 10.0.9.19 > 172.17.0.24: ICMP 10.0.9.19 unreachable - need to frag (mtu 490), length 36
10:50:55.418500 IP 10.0.9.19.27017 > 172.17.0.24.47892: Flags [.], ack 123, win 350, options [nop,nop,TS val 15892263 ecr 20069500,nop,nop,sack 1 {623:956}], length 0
10:50:55.620941 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], seq 122:622, ack 274, win 237, options [nop,nop,TS val 20069552 ecr 15892263], length 500
10:50:55.621070 IP 10.0.9.19 > 172.17.0.24: ICMP 10.0.9.19 unreachable - need to frag (mtu 490), length 36
10:50:56.028061 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], seq 122:622, ack 274, win 237, options [nop,nop,TS val 20069654 ecr 15892263], length 500
10:50:56.028278 IP 10.0.9.19 > 172.17.0.24: ICMP 10.0.9.19 unreachable - need to frag (mtu 490), length 36
10:50:56.845105 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], seq 122:622, ack 274, win 237, options [nop,nop,TS val 20069858 ecr 15892263], length 500
10:50:56.845278 IP 10.0.9.19 > 172.17.0.24: ICMP 10.0.9.19 unreachable - need to frag (mtu 490), length 36
10:50:58.479914 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], seq 122:622, ack 274, win 237, options [nop,nop,TS val 20070267 ecr 15892263], length 500
10:50:58.480052 IP 10.0.9.19 > 172.17.0.24: ICMP 10.0.9.19 unreachable - need to frag (mtu 490), length 36
10:51:01.748340 IP 10.0.9.102.47892 > 10.0.9.19.27017: Flags [.], seq 122:622, ack 274, win 237, options [nop,nop,TS val 20071084 ecr 15892263], length 500
10:51:01.748508 IP 10.0.9.19 > 172.17.0.24: ICMP 10.0.9.19 unreachable - need to frag (mtu 490), length 36

Server docker logs weave output

weave 2015/03/18 14:41:05.486129 Command line options: map[wait:20 iface:ethwe name:7a:46:86:82:f5:92]
weave 2015/03/18 14:41:05.486195 Command line peers: []
weave 2015/03/18 14:41:06.487123 Communication between peers is unencrypted.
weave 2015/03/18 14:41:06.487661 Our name is 7a:46:86:82:f5:92
weave 2015/03/18 14:41:06.518895 Sniffing traffic on &{312 65535 ethwe d6:9e:64:84:78:ca up|broadcast|multicast}
weave 2015/03/18 14:41:06.518944 Discovered our MAC d6:9e:64:84:78:ca
weave 2015/03/18 14:41:06.614431 Discovered local MAC 32:ac:71:ec:0d:24
weave 2015/03/18 14:41:06.990526 Discovered local MAC 7a:46:86:82:f5:92
weave 2015/03/18 14:41:08.091084 ->[192.168.70.25:51570] connection accepted
weave 2015/03/18 14:41:08.092523 ->[192.168.70.25:51570] completed handshake with 7a:8a:74:ed:7c:64
weave 2015/03/18 14:41:08.092873 ->[7a:8a:74:ed:7c:64]: connection added
weave 2015/03/18 14:41:08.095855 ->[7a:8a:74:ed:7c:64]: connection fully established
weave 2015/03/18 14:41:08.095909 EMSGSIZE on send, expecting PMTU update (IP packet was 60064 bytes, payload was 60056 bytes)
weave 2015/03/18 14:41:08.095928 ->[7a:8a:74:ed:7c:64]: Effective PMTU set to 1438
weave 2015/03/18 14:41:08.099198 ->[7a:8a:74:ed:7c:64]: Effective PMTU verified at 1438
weave 2015/03/18 14:41:08.735417 Discovered remote MAC 7a:8a:74:ed:7c:64 at 7a:8a:74:ed:7c:64
weave 2015/03/18 14:41:22.966496 Discovered local MAC aa:df:0e:2b:63:f1
weave 2015/03/18 14:41:47.170530 ->[7a:8a:74:ed:7c:64]: connection shutting down due to error: EOF
weave 2015/03/18 14:41:47.170646 Removed unreachable Peer 7a:8a:74:ed:7c:64 (v5) (UID 17900529838048737883)
weave 2015/03/18 14:41:47.170690 ->[7a:8a:74:ed:7c:64]: connection deleted
weave 2015/03/18 14:41:49.646219 ->[192.168.70.25:51585] connection accepted
weave 2015/03/18 14:41:49.647336 ->[192.168.70.25:51585] completed handshake with 7a:67:a3:ad:5c:1e
weave 2015/03/18 14:41:49.647479 ->[7a:67:a3:ad:5c:1e]: connection added
weave 2015/03/18 14:41:49.657602 ->[7a:67:a3:ad:5c:1e]: connection fully established
weave 2015/03/18 14:41:49.657720 EMSGSIZE on send, expecting PMTU update (IP packet was 60064 bytes, payload was 60056 bytes)
weave 2015/03/18 14:41:49.657740 ->[7a:67:a3:ad:5c:1e]: Effective PMTU set to 1438
weave 2015/03/18 14:41:49.699386 ->[7a:67:a3:ad:5c:1e]: Effective PMTU verified at 1438
weave 2015/03/18 14:41:50.059363 Discovered remote MAC 86:98:15:dd:e4:ef at 7a:67:a3:ad:5c:1e
weave 2015/03/18 14:41:50.311578 Discovered remote MAC 7a:67:a3:ad:5c:1e at 7a:67:a3:ad:5c:1e
weave 2015/03/18 14:46:42.586514 Discovered local MAC be:1b:9e:e8:24:8e
weave 2015/03/18 14:51:06.527351 Expired MAC d6:9e:64:84:78:ca at 7a:46:86:82:f5:92
weave 2015/03/18 14:52:06.527639 Expired MAC 32:ac:71:ec:0d:24 at 7a:46:86:82:f5:92
weave 2015/03/18 14:52:06.527653 Expired MAC 7a:46:86:82:f5:92 at 7a:46:86:82:f5:92
weave 2015/03/18 14:57:06.528694 Expired MAC aa:df:0e:2b:63:f1 at 7a:46:86:82:f5:92

Client docker logs weave output

weave 2015/03/18 14:41:48.704557 Command line options: map[iface:ethwe name:7a:67:a3:ad:5c:1e wait:20]
weave 2015/03/18 14:41:48.704654 Command line peers: [192.168.130.95]
weave 2015/03/18 14:41:49.705890 Communication between peers is unencrypted.
weave 2015/03/18 14:41:49.706632 Our name is 7a:67:a3:ad:5c:1e
weave 2015/03/18 14:41:49.738252 Sniffing traffic on &{63 65535 ethwe d6:53:0f:08:71:f5 up|broadcast|multicast}
weave 2015/03/18 14:41:49.738280 Discovered our MAC d6:53:0f:08:71:f5
weave 2015/03/18 14:41:49.738620 ->[192.168.130.95:6783] attempting connection
weave 2015/03/18 14:41:49.741230 ->[192.168.130.95:6783] completed handshake with 7a:46:86:82:f5:92
weave 2015/03/18 14:41:49.741735 ->[7a:46:86:82:f5:92]: connection added
weave 2015/03/18 14:41:49.749514 ->[7a:46:86:82:f5:92]: connection fully established
weave 2015/03/18 14:41:49.749593 EMSGSIZE on send, expecting PMTU update (IP packet was 60064 bytes, payload was 60056 bytes)
weave 2015/03/18 14:41:49.749608 ->[7a:46:86:82:f5:92]: Effective PMTU set to 1438
weave 2015/03/18 14:41:49.751365 ->[7a:46:86:82:f5:92]: Effective PMTU verified at 1438
weave 2015/03/18 14:41:50.152645 Discovered local MAC 86:98:15:dd:e4:ef
weave 2015/03/18 14:41:50.404767 Discovered local MAC 7a:67:a3:ad:5c:1e
weave 2015/03/18 14:41:53.299722 Discovered remote MAC 7a:46:86:82:f5:92 at 7a:46:86:82:f5:92
weave 2015/03/18 14:41:55.169164 Discovered remote MAC aa:df:0e:2b:63:f1 at 7a:46:86:82:f5:92
weave 2015/03/18 14:42:32.933803 EMSGSIZE on send, expecting PMTU update (IP packet was 927 bytes, payload was 919 bytes)
weave 2015/03/18 14:42:32.933829 ->[7a:46:86:82:f5:92]: Effective PMTU set to 490
weave 2015/03/18 14:42:32.935258 ->[7a:46:86:82:f5:92]: Effective PMTU verified at 490
weave 2015/03/18 14:42:33.135939 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:33.136136 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:33.340041 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:33.748170 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:34.564017 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:36.200074 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:39.476301 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:46.021177 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:55.819624 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:55.819747 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:56.020357 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:56.428777 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:57.244153 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:58.880001 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:59.108630 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:42:59.516569 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:00.332423 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:01.968884 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:02.148167 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:05.236450 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:08.692113 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:09.100339 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:09.916668 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:11.552377 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:11.780143 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:14.820745 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:21.363987 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:24.868180 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:34.468679 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:43:51.044114 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:44:00.645135 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:46:42.686522 Discovered remote MAC be:1b:9e:e8:24:8e at 7a:46:86:82:f5:92
weave 2015/03/18 14:50:55.416498 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:50:55.416623 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:50:55.621037 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:50:56.028239 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:50:56.845245 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:50:58.480026 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:51:01.748469 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:51:08.292302 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:51:21.380276 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:51:47.524948 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:51:49.740238 Expired MAC d6:53:0f:08:71:f5 at 7a:67:a3:ad:5c:1e
weave 2015/03/18 14:52:39.876569 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:52:49.740522 Expired MAC 7a:46:86:82:f5:92 at 7a:46:86:82:f5:92
weave 2015/03/18 14:54:24.580261 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:56:24.644420 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 14:56:49.741403 Expired MAC aa:df:0e:2b:63:f1 at 7a:46:86:82:f5:92
weave 2015/03/18 14:58:24.965048 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 15:00:25.283968 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 15:02:25.604498 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 15:04:25.924619 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490

During some debugging it became apparent that weave may be having issues with MTU discovery. Between the client and server exists a firewall whose MTU is set to 1460, but weave is somehow choosing to use an MTU of 490.

When the same test is run from a client/server on the same subnet, there is no issue.

Did a raw tcpdump on the weave interface on server and client in this manner:

reset weave everywhere
sudo weave launch (on server)
sudo tcpdump -i weave -w {capture file} (on server)
sudo weave expose 10.0.9.101/24 (on server)
[run mongo db container command]
sudo weave launch (on client)
sudo tcpdump -i weave -w {capture file} (on client)
sudo weave connect 192.168.130.95 (on client .. connects weave network to server)
[run mongo client command to insert large document]

this hangs on step 9 as described above

the raw tcpdump captures for the client/server can be accessed here: https://drive.google.com/folderview?id=0Bxv3aLTVo2eKfkhNSUI3Y2JWb1pnM1VTdVBxdzZoQzBUTTZqTzkwbEVPdldIcTFMNVk4dm8&usp=sharing

Server docker logs weave output:

weave 2015/03/18 16:30:58.466718 Command line options: map[iface:ethwe name:7a:79:3b:63:81:cf wait:20]
weave 2015/03/18 16:30:58.466773 Command line peers: []
weave 2015/03/18 16:30:59.467884 Communication between peers is unencrypted.
weave 2015/03/18 16:30:59.468369 Our name is 7a:79:3b:63:81:cf
weave 2015/03/18 16:30:59.495009 Sniffing traffic on &{380 65535 ethwe 52:4b:83:e4:a3:5a up|broadcast|multicast}
weave 2015/03/18 16:30:59.495038 Discovered our MAC 52:4b:83:e4:a3:5a
weave 2015/03/18 16:30:59.674486 Discovered local MAC 46:0c:6c:8e:73:d7
weave 2015/03/18 16:30:59.730473 Discovered local MAC 7a:79:3b:63:81:cf
weave 2015/03/18 16:31:35.070505 Discovered local MAC 5a:67:c3:4e:d9:9d
weave 2015/03/18 16:31:55.316913 ->[192.168.70.25:57528] connection accepted
weave 2015/03/18 16:31:55.318143 ->[192.168.70.25:57528] completed handshake with 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:31:55.318467 ->[7a:6b:e0:1c:2f:77]: connection added
weave 2015/03/18 16:31:55.322428 ->[7a:6b:e0:1c:2f:77]: connection fully established
weave 2015/03/18 16:31:55.322465 EMSGSIZE on send, expecting PMTU update (IP packet was 60064 bytes, payload was 60056 bytes)
weave 2015/03/18 16:31:55.322481 ->[7a:6b:e0:1c:2f:77]: Effective PMTU set to 1438
weave 2015/03/18 16:31:55.363283 ->[7a:6b:e0:1c:2f:77]: Effective PMTU verified at 1438
weave 2015/03/18 16:31:55.813341 Discovered remote MAC 7a:6b:e0:1c:2f:77 at 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:32:19.511586 Discovered remote MAC 36:45:01:da:ff:89 at 7a:6b:e0:1c:2f:77

Client docker logs weave output

weave 2015/03/18 16:31:15.005083 Command line options: map[iface:ethwe name:7a:6b:e0:1c:2f:77 wait:20]
weave 2015/03/18 16:31:15.005169 Command line peers: []
weave 2015/03/18 16:31:16.006236 Communication between peers is unencrypted.
weave 2015/03/18 16:31:16.006657 Our name is 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:31:16.061072 Sniffing traffic on &{121 65535 ethwe f6:a6:9a:dc:6e:79 up|broadcast|multicast}
weave 2015/03/18 16:31:16.061114 Discovered our MAC f6:a6:9a:dc:6e:79
weave 2015/03/18 16:31:16.373416 Discovered local MAC 7a:6b:e0:1c:2f:77
weave 2015/03/18 16:31:16.608464 Discovered local MAC 36:45:01:da:ff:89
weave 2015/03/18 16:31:55.520610 ->[192.168.130.95:6783] attempting connection
weave 2015/03/18 16:31:55.523243 ->[192.168.130.95:6783] completed handshake with 7a:79:3b:63:81:cf
weave 2015/03/18 16:31:55.523862 ->[7a:79:3b:63:81:cf]: connection added
weave 2015/03/18 16:31:55.525327 ->[7a:79:3b:63:81:cf]: connection fully established
weave 2015/03/18 16:31:55.525429 EMSGSIZE on send, expecting PMTU update (IP packet was 60064 bytes, payload was 60056 bytes)
weave 2015/03/18 16:31:55.525472 ->[7a:79:3b:63:81:cf]: Effective PMTU set to 1438
weave 2015/03/18 16:31:55.527641 ->[7a:79:3b:63:81:cf]: Effective PMTU verified at 1438
weave 2015/03/18 16:31:59.800015 Discovered remote MAC 7a:79:3b:63:81:cf at 7a:79:3b:63:81:cf
weave 2015/03/18 16:32:03.018063 Discovered remote MAC 5a:67:c3:4e:d9:9d at 7a:79:3b:63:81:cf
weave 2015/03/18 16:32:11.897951 EMSGSIZE on send, expecting PMTU update (IP packet was 927 bytes, payload was 919 bytes)
weave 2015/03/18 16:32:11.897977 ->[7a:79:3b:63:81:cf]: Effective PMTU set to 490
weave 2015/03/18 16:32:11.899324 ->[7a:79:3b:63:81:cf]: Effective PMTU verified at 490
weave 2015/03/18 16:32:12.099969 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:12.100170 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:12.304684 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:12.712316 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:13.528495 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:15.164134 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:18.436008 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:24.981010 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:38.085151 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:38.492605 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:39.308104 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:40.944347 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:44.212285 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:32:50.756138 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:33:03.844305 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490
weave 2015/03/18 16:33:30.052198 Sending ICMP 3,4 (10.0.9.19 -> 10.0.9.102): PMTU= 490

Thanks for the pcap files.

There's something conspicuously odd in the client-side capture right from the start. Although its is sending packets to the server at 10.0.9.19, the resulting packets sent by the sent always come back with the source IP address set to 172.17.0.32. This is probably an address on the docker network bridge. I don't know for sure that this is connected to the MTU issue, but I'm surprised the client and server are able to communicate at all.

That strange address is probably due to a poor interaction between the NAT that docker does in order to allow containers on the docker bridge to talk to the outside world, and how 'weave expose' is implemented.

It's hard to say whether this is really a bug or not, because the way you are trying to use weave seems strange. As I understand it, you want to run a mongo server in a container, and a mongo client in a container, talking over the weave network. Why not do this with 'weave run' on both ends? You already do that on the server side (and there the 'weave expose' and the port mapping option after 'weave run' appear to be redundant). But on the client you are starting a container with 'docker run' and relying on 'weave expose' to allow it to talk to the server container. This is unusual: The intended use of 'weave expose' is to allow things that are not in containers to talk to a weave network.

Does this make sense? Is there a particular reason why you are trying to use 'weave expose' in that way? If you use 'weave run' for both the client and server containers, do the large requests work?

I'm using weave expose on the hosts just so that I can confirm that the weave network is working easily using ping

Also, regardless if the mongo client is run from within a weave-enabled container or not, it's the same end result.

However your notes on that 172.17.0.32 address made me think twice, this is the VMWare VM NAT...

I just re-tested everything with these scenarios with the VM networking setup:

NAT only (VMWare's default mode), gives a NAT ip of 172.x, this exhibits the problems above
NAT + bridged interfaces, gets a NAT of 172.x and a bridged IP on the physical 192.168.0.0/16 network, this exhibits the same problem as described
bridged interface only, IP on physical 192.168.0.0/16, this is working.

So you're right, it's a problem isolated to my VM network configuration (albeit not working with the default VMWare networking setup of NAT)

But that seems logical, how else would the server host know?

The odd thing here is that quite a lot appeared to work, all the way to doing small inserts in mongo. Ideally at least weave should have detected that something isn't well. Even better if it could have coped and got things to work.

Perhaps 'weave should have detected' could apply equally to #416/#417, which had the similar property that a few things would work even though the network configuration was unhelpful.

@alex-sherwin, thanks for your testing. I'm still very puzzled by this!

Some thoughts:

There's no real evidence that the MTU issue and the 172.17.0.32 destination address shown in the client pcap file are related. Both are strange, so its tempting to think they are connected, but they might not be.
It's also hard to explain why it would work with vmware in bridged mode but not in NAT mode. Weave is certainly supposed to work with NAT in the way, and we do a lot of development on VMs with NAT (using various VM technologies). I doubt that the vmware NAT is doing anything special. If weave breaks when hosts are talking across a vmware NAT, that sounds like a bug in weave.
The NAT + bridged case is strange too. Why wouldn't weave simply use the bridged network interface, and work as it does in the pure bridged case? And if so, how could the presence of a NATted interface cause problems if it's not being used?
The pcap files from the client and server sides are captured from the weave bridges. So they should show exactly what appears at the two endpoints of the weave network. The packets are being modified in transit, but the weave router does not modify packets in this way. Despite the surprising way 'weave expose' was being used on the client side, that can't be the cause of the 172.17.0.32 destination address, because it happens on the other side of the weave bridge. The only thing I can imagine causing the destination address rewriting is an iptables rule on the client. But nothing mentioned on this issue suggests that such a rule has been created.

@alex-sherwin, if you are able to do some more testing for us, could you record pcap files as before, but in the case with bridged networking, where the large queries across the weave network succeed? That will indicate whether the destination address rewriting is really related to the problem or not.

@dpw, I've re-run the exact same test when my VM is running with only a bridged interface. As mentioned before, this all works with no issues. Here are the pcap files for this scenario:

https://drive.google.com/folderview?id=0Bxv3aLTVo2eKfkVLcGpHU0RfM2paSElkR2FzRzQxS2QtMm9oSUcwRXdqSFdfSU13UTFwMVE&usp=sharing

@dpw, I've re-run the exact same test when my VM is running with only a bridged interface. As mentioned before, this all works with no issues. Here are the pcap files for this scenario:

Thanks!

On the client side, the destination IP address is still being replaced, so that is not directly the cause of the problem. The difference in this capture is that there are no ICMP "frag needed" packets generated by weave, suggesting that the effective PMTU on the underlying network was not restricted to 490 as before.

So I have a theory about what is happening when it hangs: The modification of the destination address is not sufficient to prevent the TCP connection operating (surprisingly). But it is sufficient to prevent the kernel registering the ICMP "frag needed". When the client sends a packet that is too large (due to the large write), weave drops it and sends a "frag needed" back to the client side, which ignores it, and after a while TCP resends the packet, which gets dropped with another "frag needed", and so on, hence the hang. This is visible in the client capture where it hanged.

As I mentioned before, the modification of the destination address suggests an iptables rule. So @alex-sherwin, could you show the output of sudo iptables-save in the client VM? It doesn't matter what kind of networking the client has for this. Also, are you using any firewall software within the client VM?

The PTMU of 490 when vmware NAT is used seems strangely low, but is allowed.

@dpw, here's the output of iptables-save in the VM (currently the VM is back to a single NAT interface)

sudo iptables-save
# Generated by iptables-save v1.4.21 on Fri Mar 20 13:23:21 2015
*nat
:PREROUTING ACCEPT [6:468]
:INPUT ACCEPT [6:468]
:OUTPUT ACCEPT [1022:77063]
:POSTROUTING ACCEPT [922:71063]
:DOCKER - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A WEAVE ! -s 10.0.1.0/24 -o weave -j MASQUERADE
-A WEAVE -s 10.0.1.0/24 ! -o weave -j MASQUERADE
COMMIT
# Completed on Fri Mar 20 13:23:21 2015
# Generated by iptables-save v1.4.21 on Fri Mar 20 13:23:21 2015
*filter
:INPUT ACCEPT [75982:28474773]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [81854:30034942]
:DOCKER - [0:0]
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j DROP
-A FORWARD -i weave -o weave -j ACCEPT
COMMIT
# Completed on Fri Mar 20 13:23:21 2015

There is no other firewall of any kind in use in the VM or the VM host

Thanks. I don't see anything obviously wrong there. One question though:

...
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
...

That suggests docker is using the 172.17.x.x address range on its internal bridge. Earlier, you said that the same address range is being used by the VMware NAT interface. Could you double-check that?

I must have looked too quickly.. they are close but different subnets. VMWare NAT is 172.16 and docker bridge is 172.17

So the IP's in the pcap are the docker bridge as you originally suspected

docker0   Link encap:Ethernet  HWaddr 56:84:7a:fe:97:99  
          inet addr:172.17.42.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:38 errors:0 dropped:0 overruns:0 frame:0
          TX packets:118 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2568 (2.5 KB)  TX bytes:11810 (11.8 KB)

eth0      Link encap:Ethernet  HWaddr 00:0c:29:39:54:c1  
          inet addr:172.16.158.101  Bcast:172.16.158.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe39:54c1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:381947 errors:0 dropped:0 overruns:0 frame:0
          TX packets:221838 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:508118046 (508.1 MB)  TX bytes:19471935 (19.4 MB)

So, I had been trying to reproduce the 172.17.x.x destination addresses from your capture files, without success: tcpdump on the client-side weave bridge just showed 10.0.x.x addresses. But I was using a Fedora 21 test VM I had lying around to act as the client side.

So today tried with a fresh Ubuntu 14.04 VM. And it does show the 172.17.x.x destination addresses on the client side! In exactly the same test scenarios that fedora didn't. The iptables rules look the same, so I my guess is that a kernel change or patch in one of the distros is responsible, but I haven't looked into that yet (I have docker 1.5 in both cases).

I haven't been able to reproduce the hangs yet. The PMTU between my weave routers is 1500, so I have to make the "large write" case even large to trigger a "frag needed". But the write still seems to complete successfully. I'll work more on this, constraining the MTU to 490 in case that matters.

One thing I didn't appreciate until late yesterday is that the 'weave expose' on the client side is significant. Without the 'weave expose', if you use 'weave run' for the client container, what happens is that the mongo client tries the request before 'weave run' has had time to set up the ethwe interface, and so the mongo client's attempt to connect to the server simply fails. With 'weave expose', the ethwe interface still isn't set up in time, so the connection attempt goes to the container's eth0, over the docker bridge onto the host (being masqueraded in the process), and then onto the weave bridge and across the weave network. I don't think this convoluted path is really anticipated to work, and while it doesn't explain the destination address rewriting, it certainly seems like ample opportunity for odd things like that to happen. And so, another approach to fixing the hangs might be to run the client under 'weave run' but with a delay so that it does use the ethwe interface. But I'm going to try to reproduce the hangs before I resort to that.

another approach to fixing the hangs might be to run the client under 'weave run' but with a delay so that it does use the ethwe interface

Instead of a simple delay, the recommended approach is to loop (with a delay) until the interface comes up.

Instead of a simple delay, the recommended approach is to loop (with a delay) until the interface comes up.

Yes, but I imagine that could be hard to achieve in the context of a stripped-down mongodb container. Whereas it looks like a simple delay is expressible in the mongo shell.

I have found the cause of the hang: although the minimum IPv4 MTU is 68, Linux has a minimum limit on the PMTU of 552: http://cschramm.blogspot.co.uk/2012/12/why-does-linux-enforce-minimum-mtu-of.html

You can change this limit. So @alex-sherwin, if you do

echo 200 >/proc/sys/net/ipv4/route/min_pmtu

in your client VM, your hang should go away.

I'm still trying to get my head around why Linux sets this limit.

It also goes some way to explaining where the "Effective PMTU set to 490" comes from: effectiveOverhead() = 28 + 6 + (6 + 6 + 2) + 16 = 62. 552 - 62 = 490.

The whole destination address modification thing seems to be a red herring. If I tcpdump on the weave router containers ethwe interface, I don't see that rewriting. And if I tcpdump on the docker0 bridge, everything is fully NATted. tcpdumping on the weave bridge seems to show things in a strange halfway state.

Re-tested with a lower /proc/sys/net/ipv4/route/min_pmtu setting and it works this way

Re-tested with a lower /proc/sys/net/ipv4/route/min_pmtu setting and it works this way

Well, that at least confirms that I'm reproducing your problem.

I'm still not sure why the PMTU on the underlying network would drop to the default Linux minimum of 552. I wondered if it might be because ICMP "frag needed" packets were being blocked, but I spent some time trying to reproduce that without success.

Also, I'm still trying to think of anything useful weave can do to work around this problem, but I don't have any bright ideas so far. The best might simply be to show the problem in the weave status output.

@dpw, I'd be fine with that.. the worst part is always not knowing. I (and hope anyone else) wouldn't be using a VMWare NAT in this manner for anything other then development purposes

@dpw

The best might simply be to show the problem in the weave status output.

How hard is that?

How hard is that?

Detect when PMTU for a connection falls such that the overlay MTU is below 552, stash the fact in the connection, and gossip it around. Sounds doable, but I don't claim to know the gossipy side of things well.

However, I doubt that is the right solution. When the problem occurs, weave will still be broken, and in circumstances that are already fairly clear from docker logs weave (now we know what to look for). So we don't gain much at all.

One important consideration that is still not clear is: what conditions cause the PTMU to fall to 552, and how likely do we think they are to occur? If it seems rare, maybe we don't need to do anything. If it seems potentially common, we should think further about whether weave can reasonably work around the problem.

Last week I talked to a friend who has lots of experience with MTUs and what happens to them when ICMP packets are being blocked, and it sounds like there is a good chance that is the cause. More experimentation is needed though.

I've been working on other things, but I can drop them to return to this if it's a priority.

weaveworks / weave

some packet loss and other weirdness when running weave in NAT'ed VMs #469