weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 670 forks source link

Service Discovery on Docker Swarm not working #3382

Open cm6051 opened 6 years ago

cm6051 commented 6 years ago

Hi there,

I'm trying to get service discovery working with weave-net.

I'm using a docker stack file like this:

version: "3"

services:
  nginx3:
    image: cm6051/nginxcurlping
    ports:
      - 8003:80
    deploy:
      mode: replicated
      replicas: 2

  nginx4:
    image: cm6051/nginxcurlping
    ports:
      - 8004:80
    deploy:
      mode: replicated
      replicas: 2

networks:
  default:
    driver: store/weaveworks/net-plugin:2.4.0

I would expect to be able to ping the service names "nginx3" and "nginx4" from containers in this stack, but it doesn't work:

root@1e04d376eb0e:/# ping -c 2 nginx3
PING nginx3 (10.0.6.2) 56(84) bytes of data.
From 1e04d376eb0e (10.0.6.7) icmp_seq=1 Destination Host Unreachable
From 1e04d376eb0e (10.0.6.7) icmp_seq=2 Destination Host Unreachable

--- nginx3 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1009ms
root@1e04d376eb0e:/# ping -c 2 nginx4
PING nginx4 (10.0.6.5) 56(84) bytes of data.
From 1e04d376eb0e (10.0.6.7) icmp_seq=1 Destination Host Unreachable
From 1e04d376eb0e (10.0.6.7) icmp_seq=2 Destination Host Unreachable

--- nginx4 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1001ms

The error Unable to find load balancing endpoint for network mh6gnsbfiatinqiluv6aterbb occurs - I guess this is a symptom of the problem...

A similar stack file, not using weave-net, is like this:

version: "3"

services:
  nginx1:
    image: cm6051/nginxcurlping
    ports:
      - 8001:80
    deploy:
      mode: replicated
      replicas: 2

  nginx2:
    image: cm6051/nginxcurlping
    ports:
      - 8002:80
    deploy:
      mode: replicated
      replicas: 2

With this one it works OK:

root@d0aa2a463e1c:/# ping -c 2 nginx1
PING nginx1 (10.0.4.4) 56(84) bytes of data.
64 bytes from 10.0.4.4 (10.0.4.4): icmp_seq=1 ttl=64 time=0.085 ms
64 bytes from 10.0.4.4 (10.0.4.4): icmp_seq=2 ttl=64 time=0.075 ms

--- nginx1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.075/0.080/0.085/0.005 ms
root@d0aa2a463e1c:/# ping -c 2 nginx2
PING nginx2 (10.0.4.7) 56(84) bytes of data.
64 bytes from 10.0.4.7 (10.0.4.7): icmp_seq=1 ttl=64 time=0.070 ms
64 bytes from 10.0.4.7 (10.0.4.7): icmp_seq=2 ttl=64 time=0.067 ms

--- nginx2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.067/0.068/0.070/0.008 ms

Versions:

$ weave version
root@host1:~# docker plugin ls
ID                  NAME                                DESCRIPTION                   ENABLED
17c5a2fb4ac5        store/weaveworks/net-plugin:2.4.0   Weave Net plugin for Docker   true

root@host1:~# docker version
Client:
 Version:           18.06.0-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        0ffa825
 Built:             Wed Jul 18 19:09:54 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.0-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       0ffa825
  Built:            Wed Jul 18 19:07:56 2018
  OS/Arch:          linux/amd64
  Experimental:     false

$ uname -a
root@host1:~# uname -a
Linux host1 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ kubectl version
N/A (Docker Swarm)

Logs:

$ journalctl -u docker.service --no-pager
Aug 20 13:20:41 host1 dockerd[7356]: time="2018-08-20T13:20:41Z" level=error msg="INFO: 2018/08/20 13:20:41.202614 [net] NetworkAllocate mh6gnsbfiatinqiluv6aterbb" plugin=17c5a2fb4ac5d2b7f6096383dd2f8d4a73c9cc974ed3733a8f20a8239aa2c700
Aug 20 13:20:45 host1 dockerd[7356]: time="2018-08-20T13:20:45Z" level=error msg="INFO: 2018/08/20 13:20:45.327510 [net] CreateNetwork mh6gnsbfiatinqiluv6aterbb" plugin=17c5a2fb4ac5d2b7f6096383dd2f8d4a73c9cc974ed3733a8f20a8239aa2c700
Aug 20 13:20:45 host1 dockerd[7356]: time="2018-08-20T13:20:45.482865744Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 20 13:20:45 host1 dockerd[7356]: time="2018-08-20T13:20:45.483019413Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 20 13:20:45 host1 dockerd[7356]: time="2018-08-20T13:20:45Z" level=error msg="INFO: 2018/08/20 13:20:45.564019 [net] CreateEndpoint c29ea1de8935ab24151a6b85cdc3083c29ae05f652ca2b85396721a9c2f2ae00" plugin=17c5a2fb4ac5d2b7f6096383dd2f8d4a73c9cc974ed3733a8f20a8239aa2c700
Aug 20 13:20:45 host1 dockerd[7356]: time="2018-08-20T13:20:45Z" level=error msg="INFO: 2018/08/20 13:20:45.585080 [net] JoinEndpoint mh6gnsbfiatinqiluv6aterbb:c29ea1de8935ab24151a6b85cdc3083c29ae05f652ca2b85396721a9c2f2ae00 to /var/run/docker/netns/f70781a1f9a9" plugin=17c5a2fb4ac5d2b7f6096383dd2f8d4a73c9cc974ed3733a8f20a8239aa2c700
Aug 20 13:20:45 host1 dockerd[7356]: time="2018-08-20T13:20:45Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/fa428d276a9353e04c5641a46f1af77409bc286dd41714bfb5ea8f5280449754/shim.sock" debug=false pid=16359
Aug 20 13:20:47 host1 dockerd[7356]: time="2018-08-20T13:20:47.001048409Z" level=error msg="addLBBackend mh6gnsbfiatinqiluv6aterbb/nginxweave_default: Unable to find load balancing endpoint for network mh6gnsbfiatinqiluv6aterbb"
Aug 20 13:20:47 host1 dockerd[7356]: time="2018-08-20T13:20:47.001321749Z" level=error msg="addLBBackend mh6gnsbfiatinqiluv6aterbb/nginxweave_default: Unable to find load balancing endpoint for network mh6gnsbfiatinqiluv6aterbb"
Aug 20 13:20:48 host1 dockerd[7356]: time="2018-08-20T13:20:48.386548234Z" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers: [nameserver 8.8.8.8 nameserver 8.8.4.4]"
Aug 20 13:20:48 host1 dockerd[7356]: time="2018-08-20T13:20:48.387729192Z" level=info msg="IPv6 enabled; Adding default IPv6 external servers: [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
Aug 20 13:20:48 host1 dockerd[7356]: time="2018-08-20T13:20:48Z" level=error msg="INFO: 2018/08/20 13:20:48.496566 [net] CreateEndpoint 4c175ec7baaeaf0d8aabc50c25ce0ab92f1594b736cd3951d981d7583be402d3" plugin=17c5a2fb4ac5d2b7f6096383dd2f8d4a73c9cc974ed3733a8f20a8239aa2c700
Aug 20 13:20:48 host1 dockerd[7356]: time="2018-08-20T13:20:48Z" level=error msg="INFO: 2018/08/20 13:20:48.511611 [net] JoinEndpoint mh6gnsbfiatinqiluv6aterbb:4c175ec7baaeaf0d8aabc50c25ce0ab92f1594b736cd3951d981d7583be402d3 to /var/run/docker/netns/736ab85021e5" plugin=17c5a2fb4ac5d2b7f6096383dd2f8d4a73c9cc974ed3733a8f20a8239aa2c700
Aug 20 13:20:48 host1 dockerd[7356]: time="2018-08-20T13:20:48Z" level=info msg="shim docker-containerd-shim started" address="/containerd-shim/moby/1e04d376eb0ea6822c4d875b8b7b66c4cc30822e58382cc5671605232110af2d/shim.sock" debug=false pid=16559
Aug 20 13:20:49 host1 dockerd[7356]: time="2018-08-20T13:20:49.722577530Z" level=error msg="addLBBackend mh6gnsbfiatinqiluv6aterbb/nginxweave_default: **Unable to find load balancing endpoint for network mh6gnsbfiatinqiluv6aterbb**"
Aug 20 13:20:49 host1 dockerd[7356]: time="2018-08-20T13:20:49.775719997Z" level=error msg="addLBBackend mh6gnsbfiatinqiluv6aterbb/nginxweave_default: Unable to find load balancing endpoint for network mh6gnsbfiatinqiluv6aterbb"

Network:

root@host1:~# ip route
default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15
10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.18.0.0/16 dev docker_gwbridge proto kernel scope link src 172.18.0.1
192.168.43.0/24 dev enp0s8 proto kernel scope link src 192.168.43.11
224.0.0.0/4 dev enp0s8 scope link
root@host1:~# ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: enp0s3    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic enp0s3\       valid_lft 71996sec preferred_lft 71996sec
3: enp0s8    inet 192.168.43.11/24 brd 192.168.43.255 scope global enp0s8\       valid_lft forever preferred_lft forever
4: docker0    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\       valid_lft forever preferred_lft forever
9: docker_gwbridge    inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge\       valid_lft forever preferred_lft forever
root@host1:~# iptables-save
# Generated by iptables-save v1.6.1 on Mon Aug 20 13:21:42 2018
*mangle
:PREROUTING ACCEPT [106886:232629203]
:INPUT ACCEPT [67747:129705231]
:FORWARD ACCEPT [39139:102923972]
:OUTPUT ACCEPT [55895:10215454]
:POSTROUTING ACCEPT [95034:113139426]
COMMIT
# Completed on Mon Aug 20 13:21:42 2018
# Generated by iptables-save v1.6.1 on Mon Aug 20 13:21:42 2018
*nat
:PREROUTING ACCEPT [5:300]
:INPUT ACCEPT [5:300]
:OUTPUT ACCEPT [5:300]
:POSTROUTING ACCEPT [5:300]
:DOCKER - [0:0]
:DOCKER-INGRESS - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER-INGRESS
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT -m addrtype --dst-type LOCAL -j DOCKER-INGRESS
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -o docker_gwbridge -m addrtype --src-type LOCAL -j MASQUERADE
-A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A DOCKER -i docker_gwbridge -j RETURN
-A DOCKER -i docker0 -j RETURN
-A DOCKER-INGRESS -p tcp -m tcp --dport 8004 -j DNAT --to-destination 172.18.0.2:8004
-A DOCKER-INGRESS -p tcp -m tcp --dport 8003 -j DNAT --to-destination 172.18.0.2:8003
-A DOCKER-INGRESS -j RETURN
COMMIT
# Completed on Mon Aug 20 13:21:42 2018
# Generated by iptables-save v1.6.1 on Mon Aug 20 13:21:42 2018
*filter
:INPUT ACCEPT [409:63110]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [331:81591]
:DOCKER - [0:0]
:DOCKER-INGRESS - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
:WEAVE-EXPOSE - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-INGRESS
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -i weave -o weave -j ACCEPT
-A FORWARD -o weave -j WEAVE-EXPOSE
-A FORWARD -i weave ! -o weave -j ACCEPT
-A FORWARD -o weave -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker_gwbridge -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker_gwbridge -j DOCKER
-A FORWARD -i docker_gwbridge ! -o docker_gwbridge -j ACCEPT
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A FORWARD -i docker_gwbridge -o docker_gwbridge -j DROP
-A DOCKER-INGRESS -p tcp -m tcp --dport 8004 -j ACCEPT
-A DOCKER-INGRESS -p tcp -m state --state RELATED,ESTABLISHED -m tcp --sport 8004 -j ACCEPT
-A DOCKER-INGRESS -p tcp -m tcp --dport 8003 -j ACCEPT
-A DOCKER-INGRESS -p tcp -m state --state RELATED,ESTABLISHED -m tcp --sport 8003 -j ACCEPT
-A DOCKER-INGRESS -j RETURN
-A DOCKER-ISOLATION-STAGE-1 -i docker_gwbridge ! -o docker_gwbridge -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker_gwbridge -j DROP
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Mon Aug 20 13:21:42 2018
totaki commented 6 years ago

I have same problem, for me it's work if i use endpoint_mode: dnsrr. But i can undertstand how i can expose services to some nodes. In example using weave dns-lookup some-container, but if using weave status not DNS service and if i get ps -aux | grep weave i see '--no-dns'. Maybe you have more real example where isung weave plugin and swarm services for exposing some services.

bboreham commented 6 years ago

@cm6051 can you run curl 127.0.0.1:6783/status on the node with Weave Net running and post the result here please?

I will comment that the service discovery piece is working, since ping tries to talk to a specific address. But then nothing comes back so I wonder if the network itself is set up ok.

bboreham commented 6 years ago

Also the entire log would be helpful - looks like you just posted 8 seconds.

mvtorres commented 6 years ago

We had the same problem, seem to be happening on docker version 18.06

It works on 18.03.1-ce

mvtorres commented 6 years ago

Our test setup

version: '3.4'

services:

  foo:
    image: alpine
    entrypoint: sleep 999

  bar:
    image: alpine
    entrypoint: sleep 999

networks:
  default:
    driver: weaveworks/net-plugin:2.1.3
    ipam:
      driver: default
      config:
      - subnet: 10.32.1.0/24

To test

docker exec -it $(docker ps |grep foo| awk '{print $1}') ping bar
docker exec -it $(docker ps |grep bar| awk '{print $1}') ip address

The symptom was that the ip resolved for bar, and bar's ip address are different.

It failed even with a one-node swarm It also failed on weaveworks/net-plugin:2.4.0

deminngi commented 6 years ago

Can also confirm that I get the same error under weaveworks/net-plugin:2.4.0 using docker-ce-18.06.0-ce.

No way to find the LB endpoint.

This is my setup for a custom weave network: docker network create --driver=weaveworks/net-plugin:2.4.0 --subnet 10.32.0.0/24 --gateway 10.32.0.1 --attachable weave

Could it be that launch.sh is not configured properly and is related to "providing access to the docker API (in containers)."?

#!/bin/sh

set -e

# Default if not supplied - same as weave net default
IPALLOC_RANGE=${IPALLOC_RANGE:-10.32.0.0/12}
HTTP_ADDR=${WEAVE_HTTP_ADDR:-127.0.0.1:6784}
STATUS_ADDR=${WEAVE_STATUS_ADDR:-0.0.0.0:6782}
HOST_ROOT=${HOST_ROOT:-/host}
LOG_LEVEL=${LOG_LEVEL:-info}
WEAVE_DIR="/host/var/lib/weave"

mkdir $WEAVE_DIR || true

echo "Starting launch.sh"

# Check if the IP range overlaps anything existing on the host
/usr/bin/weaveutil netcheck $IPALLOC_RANGE weave

# We need to get a list of Swarm nodes which might run the net-plugin:
# - In the case of missing restart.sentinel, we assume that net-plugin has started
#   for the first time via the docker-plugin cmd. So it's safe to request docker.sock.
# - If restart.sentinel present, let weaver restore from it as docker.sock is not
#   available to any plugin in general (https://github.com/moby/moby/issues/32815).

PEERS=
if [ ! -f "/restart.sentinel" ]; then
    PEERS=$(/usr/bin/weaveutil swarm-manager-peers)
fi

router_bridge_opts() {
    echo --datapath=datapath
    [ -z "$WEAVE_MTU" ] || echo --mtu "$WEAVE_MTU"
    [ -z "$WEAVE_NO_FASTDP" ] || echo --no-fastdp
}

multicast_opt() {
    [ -z "$WEAVE_MULTICAST" ] || echo "--plugin-v2-multicast"
}

exec /home/weave/weaver $EXTRA_ARGS --port=6783 $(router_bridge_opts) \
    --host-root=/host \
    --proc-path=/host/proc \
    --http-addr=$HTTP_ADDR --status-addr=$STATUS_ADDR \
    --no-dns \
    --ipalloc-range=$IPALLOC_RANGE \
    --nickname "$(hostname)" \
    --log-level=$LOG_LEVEL \
    --db-prefix="$WEAVE_DIR/weave" \
    --plugin-v2 \
    $(multicast_opt) \
    --plugin-mesh-socket='' \
    --docker-api='' \
    $PEERS
johny-mnemonic commented 6 years ago

This doesn't seem to be related to weave. I just tried the described by @mvtorres here and it behaves the same without weave plugin even installed on my system with docker 18.06-ce. When used with

deploy:
        endpoint_mode: dnsrr

The IP address of the bar is visible to foo, which is strange. It looks as if docker started serving internal services using ingress :-O

jmkgreen commented 6 years ago

I think we are in the same boat (although the title of this issue should be re-worded).

Effectively, weave is not working in swarm mode at all, yet overlay works fine in it's place.

We have Docker 18.06.1-ce and launch two stacks wherein a container in each shares the very same network. The only particular networking characteristic we have applied is an alias when the container is attached to the shared network. We do not specify replicas. When I exec in to the containers they can resolve each other but ping reports the destination unreachable:

activemq@graves:/opt/apache-activemq-5.13.4$ ping billing-activemq
PING billing-activemq (10.101.0.6) 56(84) bytes of data.
From graves (10.101.0.9) icmp_seq=1 Destination Host Unreachable
From graves (10.101.0.9) icmp_seq=2 Destination Host Unreachable
From graves (10.101.0.9) icmp_seq=3 Destination Host Unreachable
From graves (10.101.0.9) icmp_seq=4 Destination Host Unreachable
From graves (10.101.0.9) icmp_seq=5 Destination Host Unreachable
From graves (10.101.0.9) icmp_seq=6 Destination Host Unreachable

Here's the routing table if it helps at all:

root@graves:/opt/apache-activemq-5.13.4# routel
         target            gateway          source    proto    scope    dev tbl
        default         172.18.0.1                                     eth1 
       10.0.9.0 24                       10.0.9.49   kernel     link   eth0 
     10.101.0.0 28                      10.101.0.9   kernel     link ethwe0 
     172.18.0.0 16                     172.18.0.13   kernel     link   eth1 
       10.0.9.0          broadcast       10.0.9.49   kernel     link   eth0 local
      10.0.9.49              local       10.0.9.49   kernel     host   eth0 local
     10.0.9.255          broadcast       10.0.9.49   kernel     link   eth0 local
     10.101.0.0          broadcast      10.101.0.9   kernel     link ethwe0 local
     10.101.0.9              local      10.101.0.9   kernel     host ethwe0 local
    10.101.0.15          broadcast      10.101.0.9   kernel     link ethwe0 local
      127.0.0.0          broadcast       127.0.0.1   kernel     link     lo local
      127.0.0.0 8            local       127.0.0.1   kernel     host     lo local
      127.0.0.1              local       127.0.0.1   kernel     host     lo local
127.255.255.255          broadcast       127.0.0.1   kernel     link     lo local
     172.18.0.0          broadcast     172.18.0.13   kernel     link   eth1 local
    172.18.0.13              local     172.18.0.13   kernel     host   eth1 local
 172.18.255.255          broadcast     172.18.0.13   kernel     link   eth1 local
        default        unreachable                   kernel              lo 
        default        unreachable                   kernel              lo 

Now, if we take down our stack and remove the shared network, then re-create the shared network using the overlay driver, the problem disappears.

One other thing (that I cannot imagine is related) - the documentation for installing the swarm plugin asks us to install weaveworks/net-plugin:latest_release which is not found. If we refer to store/weaveworks/net-plugin:latest_release it does work. Here it is under docker plugin ls:

ID                  NAME                                         DESCRIPTION                          ENABLED
c40c08f82ac5        store/weaveworks/net-plugin:latest_release   Weave Net plugin for Docker          true
htplbc commented 5 years ago

@cm6051 can you run curl 127.0.0.1:6783/status on the node with Weave Net running and post the result here please?

I will comment that the service discovery piece is working, since ping tries to talk to a specific address. But then nothing comes back so I wonder if the network itself is set up ok.

Didn't see a response for this so I went ahead and ran this since I'm having the same issue.

curl 127.0.0.1:6784/status
        Version: 2.4.1 (up to date; next check at 2018/10/23 11:11:38)

        Service: router
       Protocol: weave 1..2
           Name: 5e:53:e8:e9:d3:71(master0)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 3
    Connections: 19 (18 established, 1 failed)
          Peers: 19 (with 342 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: idle
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

        Service: plugin (v2)
mploschiavo commented 5 years ago

My peers are only 1. I tried with docker 18.03, 18.03, and 18.09. Why are my peers and connections get stuck at 0 and 1. Should I be able to curl on 6782 ? 6783? and 6784?
ubuntu@mattjienv-mgr-000000:~/helpers$ curl 127.0.0.1:6782/status Version: 2.5.0 (up to date; next check at 2018/11/10 03:46:49)

    Service: router
   Protocol: weave 1..2
       Name: 32:64:ac:93:4e:ef(mattjienv-mgr-000000)
 Encryption: disabled

PeerDiscovery: enabled Targets: 0 Connections: 0 Peers: 1 TrustedSubnets: none

    Service: ipam
     Status: idle
      Range: 10.32.0.0/12

DefaultSubnet: 10.32.0.0/12

ubuntu@mattjienv-mgr-000000:~/helpers$ docker plugin ls ID NAME DESCRIPTION ENABLED 20ca0ab5a7c3 cloudstor:azure cloud storage plugin for Docker true 3a902c69ea06 store/weaveworks/net-plugin:latest_release Weave Net plugin for Docker false 0c35a1de7b79 weaveworks/net-plugin:latest_release Weave Net plugin for Docker true

ubuntu@mattjienv-mgr-000000:~/helpers$ docker node ls ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION rbsgcnjiz0x4c0zfvy2t7a1ab mattjienv-es-hdfs-data-000000 Ready Active 18.09.0 wzoy1cbnkhbq9omo4vzgfbfg5 mattjienv-es-hdfs-nn1-000000 Ready Active 18.09.0 l8s0yv8ynhj817ygu51lpw5f5 mattjienv-es-log-000000 Ready Active 18.09.0 a0b6lt4fgktk3hsmyrs884fyk mattjienv-hdfs-data-000000 Ready Active 18.09.0 l74d44jcj37b83o4y5fgiz3jk * mattjienv-mgr-000000 Ready Active Leader 18.09.0 uy0j6m0tf936iqvzzo07u5nw5 mattjienv-storm-000000 Ready Active 18.09.0 ch1itm1nmoveddfao8ov5518i mattjienv-storm-000001 Ready Active 18.09.0 9kc3l2mnj0grbigpmz1h58slm mattjienv-storm-000002 Ready Active 18.09.0 ju1g9w82n2ru0pncw94fa99gn mattjienv-util-000000 Ready Active 18.09.0 ty0cudabz05ukhrs02f49difu mattjienv-zk-kafka-000000 Ready Active 18.09.0

ubuntu@mattjienv-mgr-000000:~/helpers$ docker version Client: Version: 18.09.0 API version: 1.39 Go version: go1.10.4 Git commit: 4d60db4 Built: Wed Nov 7 00:48:57 2018 OS/Arch: linux/amd64 Experimental: false

Server: Docker Engine - Community Engine: Version: 18.09.0 API version: 1.39 (minimum version 1.12) Go version: go1.10.4 Git commit: 4d60db4 Built: Wed Nov 7 00:16:44 2018 OS/Arch: linux/amd64 Experimental: false

bboreham commented 5 years ago

When you define a service publishing a port like ports: - 8003:80, Docker Swarm returns its "ingress" routing mesh address for that port. See https://docs.docker.com/engine/swarm/ingress/.

These are virtual IP addresses, only defined for that specific port, which is why ping doesn't work. Ping doesn't have port numbers.

As noted later "it's work if i use endpoint_mode: dnsrr" - this mode turns off the virtual IPs and returns the real underlying IPs of the containers. See https://docs.docker.com/engine/swarm/ingress/#without-the-routing-mesh.

I don't think there are any other points unanswered in comments, relating to the title of this issue.

@mploschiavo I cannot see the relevance of your comment to this issue. Suggest you open a new issue with the specifics.

davidsmith1307 commented 5 years ago

Whilst ping may not actually work and ICMP may not make it through, it is an easy was of resolving an address when most images don't have dig or nslookup. We are seeing the same issue - the address resolved on the 'client' container is always one lower then the 'server' container's IP address in the last octet. The IP addresses returned are also on the weave subnet, not the ingress subnet. Docker 18.09.

bboreham commented 5 years ago

I’m talking about the addresses created by Docker for the purpose of routing requests inside the cluster. The “ingress network” is for routing requests that arrive at the host.

davidsmith1307 commented 5 years ago

?Ok.

So what is the 'correct' way of setting up our scenario. It seems from github that there are a number of people in the same boat and I believe I have followed the published instructions.


From: Bryan Boreham notifications@github.com Sent: 12 November 2018 19:50 To: weaveworks/weave Cc: Smith, David (09); Comment Subject: Re: [weaveworks/weave] Service Discovery on Docker Swarm not working (#3382)

I'm talking about the addresses created by Docker for the purpose of routing requests inside the cluster. The "ingress network" is for routing requests that arrive at the host.

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/weaveworks/weave/issues/3382#issuecomment-438007063, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKb3XOq5WElRNNBplMkjWJAGXENoLniyks5uudDtgaJpZM4WD_t_.


This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com


The contents of this e-mail are confidential and are intended only for the use of the recipient(s) unless otherwise indicated. If you have received this e-mail in error, please notify the sender(s) immediately by telephone. Please destroy and delete the message from your computer. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail is strictly prohibited unless expressly authorised by the sender(s). No person, without written confirmation of the contents of this e-mail, should rely on it. Whilst this e-mail and the information it contains are supplied in good faith, no member of the Balfour Beatty plc group of companies shall be under any liability in respect of the contents of this e-mail or for any reliance the recipient may place on it. This e-mail is sent for information purposes only and shall not have the effect of creating a contract between the parties. The following companies have their registered office at 5 Churchill Place, Canary Wharf, London E14 5HU: Balfour Beatty Rail Limited (with registered no. 01982627), Balfour Beatty Rail Infrastructure Services Limited (with registered no. 0772439), Balfour Beatty Rail Projects Limited (with registered no. 00772437), Birse Rail Limited (with registered no. 04319685), Birse Metro Limited (with registered no. 04319686), Balfour Beatty Rail Technologies Limited (with registered no. 00235437), Balfour Beatty Civil & Construction Plant Services Limited (with registered no. 02449888) and Balfour Beatty Rail Corporate Services Limited (with registered no. 03110383).

Germany Balfour Beatty Rail GmbH Garmischer Str. 35 81373 München Deutschland Sitz der Gesellschaft: München Handelsregister: HRB 133768 Geschäftsführer Dr. Heike Albrecht


Warning: Although the company has taken reasonable precautions to ensure no viruses or other malware are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com


bboreham commented 5 years ago

It’s all correct, in the sense that Docker Swarm is doing what it is designed to do.

If you use the dns round-robin mode, or tell if you don’t have any ports, then it behaves differently.

davidsmith1307 commented 5 years ago

?It would be a big help if that was mentioned in the docs.

So I need to use dnsrr if I have any ports exposed on the database server?


From: Bryan Boreham notifications@github.com Sent: 12 November 2018 20:29 To: weaveworks/weave Cc: Smith, David (09); Comment Subject: Re: [weaveworks/weave] Service Discovery on Docker Swarm not working (#3382)

It's all correct, in the sense that Docker Swarm is doing what it is designed to do.

If you use the dns round-robin mode, or tell if you don't have any ports, then it behaves differently.

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/weaveworks/weave/issues/3382#issuecomment-438018121, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKb3XBxiKMU7N90OhR7RQPI9kTR0t6t0ks5uudofgaJpZM4WD_t_.


This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com


The contents of this e-mail are confidential and are intended only for the use of the recipient(s) unless otherwise indicated. If you have received this e-mail in error, please notify the sender(s) immediately by telephone. Please destroy and delete the message from your computer. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail is strictly prohibited unless expressly authorised by the sender(s). No person, without written confirmation of the contents of this e-mail, should rely on it. Whilst this e-mail and the information it contains are supplied in good faith, no member of the Balfour Beatty plc group of companies shall be under any liability in respect of the contents of this e-mail or for any reliance the recipient may place on it. This e-mail is sent for information purposes only and shall not have the effect of creating a contract between the parties. The following companies have their registered office at 5 Churchill Place, Canary Wharf, London E14 5HU: Balfour Beatty Rail Limited (with registered no. 01982627), Balfour Beatty Rail Infrastructure Services Limited (with registered no. 0772439), Balfour Beatty Rail Projects Limited (with registered no. 00772437), Birse Rail Limited (with registered no. 04319685), Birse Metro Limited (with registered no. 04319686), Balfour Beatty Rail Technologies Limited (with registered no. 00235437), Balfour Beatty Civil & Construction Plant Services Limited (with registered no. 02449888) and Balfour Beatty Rail Corporate Services Limited (with registered no. 03110383).

Germany Balfour Beatty Rail GmbH Garmischer Str. 35 81373 München Deutschland Sitz der Gesellschaft: München Handelsregister: HRB 133768 Geschäftsführer Dr. Heike Albrecht


Warning: Although the company has taken reasonable precautions to ensure no viruses or other malware are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com


davidsmith1307 commented 5 years ago

I changed the service definition for db to remove the exposed ports and use dnsrr

db:

image: mariadb

hostname: db.weave.local

deploy:

  endpoint_mode: dnsrr

  restart_policy:

    condition: on-failure

    delay: 5s

    max_attempts: 3

    window: 120s

environment:

  - MYSQL_ROOT_PASSWORD=example

  - MYSQL_DATABASE=keycloak

  - MYSQL_USER=keycloak

  - MYSQL_PASSWORD=password

networks:

  - ensyte

deploy:

  placement:

    constraints:

      - node.labels.type == primary

      - node.role == worker

volumes:

  - mariadata:/var/lib/mysql

I get exactly the same symptoms

drsmith@swarm01:~$ docker exec -it fbb1db037e32 bash root@db:/# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 14220: ethwe0@if14221: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP group default link/ether 52:a6:02:95:4b:c7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.168.13.20/16 brd 192.168.255.255 scope global ethwe0 valid_lft forever preferred_lft forever 14222: eth0@if14223: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.18.0.3/16 brd 172.18.255.255 scope global eth0 valid_lft forever preferred_lft forever

[jboss@kc1 ~]$ ping db PING db (192.168.13.19) 56(84) bytes of data.


From: Bryan Boreham notifications@github.com Sent: 12 November 2018 20:29 To: weaveworks/weave Cc: Smith, David (09); Comment Subject: Re: [weaveworks/weave] Service Discovery on Docker Swarm not working (#3382)

It's all correct, in the sense that Docker Swarm is doing what it is designed to do.

If you use the dns round-robin mode, or tell if you don't have any ports, then it behaves differently.

- You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/weaveworks/weave/issues/3382#issuecomment-438018121, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKb3XBxiKMU7N90OhR7RQPI9kTR0t6t0ks5uudofgaJpZM4WD_t_.


This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com


The contents of this e-mail are confidential and are intended only for the use of the recipient(s) unless otherwise indicated. If you have received this e-mail in error, please notify the sender(s) immediately by telephone. Please destroy and delete the message from your computer. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail is strictly prohibited unless expressly authorised by the sender(s). No person, without written confirmation of the contents of this e-mail, should rely on it. Whilst this e-mail and the information it contains are supplied in good faith, no member of the Balfour Beatty plc group of companies shall be under any liability in respect of the contents of this e-mail or for any reliance the recipient may place on it. This e-mail is sent for information purposes only and shall not have the effect of creating a contract between the parties. The following companies have their registered office at 5 Churchill Place, Canary Wharf, London E14 5HU: Balfour Beatty Rail Limited (with registered no. 01982627), Balfour Beatty Rail Infrastructure Services Limited (with registered no. 0772439), Balfour Beatty Rail Projects Limited (with registered no. 00772437), Birse Rail Limited (with registered no. 04319685), Birse Metro Limited (with registered no. 04319686), Balfour Beatty Rail Technologies Limited (with registered no. 00235437), Balfour Beatty Civil & Construction Plant Services Limited (with registered no. 02449888) and Balfour Beatty Rail Corporate Services Limited (with registered no. 03110383).

Germany Balfour Beatty Rail GmbH Garmischer Str. 35 81373 München Deutschland Sitz der Gesellschaft: München Handelsregister: HRB 133768 Geschäftsführer Dr. Heike Albrecht


Warning: Although the company has taken reasonable precautions to ensure no viruses or other malware are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com


johny-mnemonic commented 5 years ago

When you define a service publishing a port like ports: - 8003:80, Docker Swarm returns its "ingress" routing mesh address for that port. See https://docs.docker.com/engine/swarm/ingress/.

These are virtual IP addresses, only defined for that specific port, which is why ping doesn't work. Ping doesn't have port numbers.

As was mentioned above, this issue is not related to exporting ports outside of docker cluster using ingress. Returning of virtual mesh IP inside the docker cluster is the actual issue we are dealing with here. It doesn't make any sense to do such thing when you have nodes inside one virtual network (i.e. inside defined network for your stack). I am not sure, what is that virtual IP even good for. Maybe it makes sense for communication between different defined networks inside the swarm cluster? I don't know.

Anyway, inside one defined network it is wrong to resolve virtual IP of the container for anything, because then the container itself thinks it is reachable on one IP, while others on the same network see it running on completely different IP, which is obviously an issue for some services.

bboreham commented 5 years ago

Returning of virtual mesh IP inside the docker cluster is the actual issue we are dealing with here. It doesn't make any sense to do such thing

This is a Docker behaviour, not something we have control of at the network layer.

That said, I would expect the virtual IP to re-route to the container IP somehow. I don't know enough about Swarm mesh routing to say where it's going wrong.

atyutyunnik commented 5 years ago

as a workaround, one should use legacy mode (v1) and create an overlay network for the swarm. Weave Net comes handy when I need to establish a VPN (between my VM's private network and a public VM), but to make swarming work over it, an overlay network should be utilized. Weave plugin (V2) simply won't do. But with overlay + Weave's legacy mode, remote hosts find services across the entire swarm and service are properly routed, even if I drain the swarm's manager node, curl-requests pass over to workers and back just fine.

Wasted a few days to figure this out. I hope the legacy mode will stay for a while, because it's awesome. I can't say the same about V2 :=/

atyutyunnik commented 5 years ago

Here are the few important commands: on a manager node (I am using 192.77... because the default conflicts with my network):

Initiate weave on manager node

$ weave launch --ipalloc-range 192.77.1.0/24

Initiate the swarm, retain the output for use on worker nodes

$ docker swarm init --advertise-addr $(weave expose) docker swarm join --token SWMTKN-1-3ljuqvtqgbli21dx1i1z4oar5kny13ymic6pv3dm8cv7ya7oh3-au1j2ud72xl3q3d0gp0j8bi39 192.77.1.1:2377

The above will look different for your case, anyway, as I said, retain the join-token statement for further use on woker nodes below

Create an overlay network for the services/stacks to run over on the swarm

$ docker network create -d overlay --ip-range 192.77.1.0/24 --subnet 192.77.1.0/24 weave2

on worker nodes: $ weave launch --ipalloc-range 192.77.1.0/24 <specify manager's node ip by which it's accessible from the workers> $ weave expose

$ #Here use the output from the manager's swarm init above in order to join the swarm

Now can create services and stacks on the manager node and enjoy proper routing. The WeaveNet documentation is misleading that one can only use V2 plugin for swarm mode. Not true. In fact, the only way I could make it work with Legacy plus the overlay driver networking

DanielUranga commented 5 years ago

I have two identical Swarms, except one is running Docker 18.03.1-ce and the other 18.06.1-ce. With version 18.06.1-ce I get exactly this same issue, but on 18.03.1-ce it works as expected. As a workaround I'm considering downgrading to 18.03.1-ce.

DanielUranga commented 5 years ago

Downgrading to Docker 18.03.1-ce solved the issue.