weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 671 forks source link

zeroconf weave connectivity #224

Open rade opened 9 years ago

rade commented 9 years ago

At present, in order to establish a weave network, new peers need to be told about at least one other peer, or vice versa.

We could introduce a mode of operation where weave peers discover each other using, say, mDNS. Underlying networks permitting.

Main question is how broadly supported the various rendezvous technologies are. Will it work on AWS? GCE? Across availability zones? Typical corporate data centres?

Suggested by @inercia.

inercia commented 9 years ago

I'm afraid this will not work on AWS or GCE. But it seems that some other providers (like RackSpace) do support multicast. I think you can expect corporate data centres to support multicast...

I think this feature would be relatively cheap to implement as Weave already has a mDNS client...

rade commented 9 years ago

Hmm. That makes it significantly less attractive.

I suppose this could work in a hybrid way, i.e. weave would connect to any address specified on the command line, as currently, but also attempt to find peers via zeroconf. That way when, say, I have a weave network that spans multiple data centres and AWS, zeroconf is used for discovery inside the data centres, but the existing mechanism is used inside AWS and across data centres.

inercia commented 9 years ago

Yes, we could use a hybrid solution. Weave could be started with something like weave launch --use-mdns 10.0.1.1. Weave could then start a mDSN client and look for something like "weave.local" (or some other name, I don't know if this would conflict with WeaveDNS...). But once a peer has been found, the Weave node should have a way of knowing if the new, discovered peer is working on the same virtual switch... Maybe all the peers share a UUID or something...

bnitin commented 9 years ago

I think this is related. A requirement I have is to incrementally add nodes to a weave network. So initially the network might consist of a few nodes in AWS region-1 and gradually we would add nodes in AWS region-2 and expect all the nodes to be able to talk to each other.

I'm ok with initially issuing commands whenever each node is added so that existing nodes are aware of a newly added node.

But there needs to be a way to dynamically add nodes to an existing network (I couldn't figure out from documentation that it this was already supported).

rade commented 9 years ago

there needs to be a way to dynamically add nodes to an existing network

There is. It's fundamental to how weave works.

I couldn't figure out from documentation that it this was already supported

http://zettio.github.io/weave/features.html#dynamic-topologies

inercia commented 9 years ago

I'm looking at the mDNS code in the /nameserver, and I have found that the client only waits for the first reponse, making it difficult to register several IPs for the same name (ie, "somegroup.weave.local."), specially since the first response is usually the local node... Any idea on how this could be solved? Maybe the resolver could exhaust the timeout waiting for responses...

rade commented 9 years ago

See #225 and #226.

inercia commented 9 years ago

I agree that that kind of DNS resolution would be good thing as a general case, but I don't think it would be necessary for this kind of local peers discovery because latency is not a problem. We could have a background goroutine that periodically queries for peers with a given DNS name. For example, if we try to get all the A-records for "weave.local" every 5 seconds, we can do a timed, blocking resolution and wait up to 5 seconds for answers...

rade commented 9 years ago

Yes, for local peer discovery we can't only accept the first answer, since then we'd risk constructing multiple disconnected networks.

I don't see why you'd do a timed, blocking resolution though. Surely we'd be happy to accept any answer, whenever it came it, and add it to our list of endpoints we know about. We do need to work out how frequently to ask though, in order to cope with the question (or answers) getting lost.

inercia commented 9 years ago

I mean that this mDNS peers discoverer should send multicast queries and block (not in the strict sense) waiting for responses for some time X, then send another query and wait again, and so on...

Maybe the mdns_client could be modified for this kind of queries, so that the inflight is not cleaned up and the responses channel is not closed after getting new mDNS responses... A new flag to SendQuery could be added (or maybe a new function, something like PersistentQuery), and users should explicitly call PersistentQueryCancel when they are done...

inercia commented 9 years ago

I'm stuck with the iptables stuff for multicast traffic between different containers in different hosts, maybe related to this issue. Forward path works fine (ie, sending mDNS queries), but responses never come back (ie, DNS A). I will continue investigating on this...

bboreham commented 9 years ago

I often think this is caused by Reverse Path Filtering, i.e. if Linux doesn't think packets from that destination should be reaching you, then it will throw them away.

Examples include: send and receive addresses are the same, packets coming from a subnet which doesn't route to your subnet.

You can turn off reverse path filtering to see if this is the case, which may lead to deeper understanding of the real problem. (Generally we want Weaveworks software to work with Linux defaults)

inercia commented 9 years ago

I have tried with disabling the RP filtering as you suggested, @bboreham, but it does not seem to help. I have tried to setup the easiest scenario with the help of Avahi's mDNS reflector (basically, a mDNS proxy) but, even with this, it does not work. My testing setup is this:

weave1 container (172.17.51.2)
    |
[docker bridge0] (172.17.51.1)
    |
[host1 eth0] (192.168.121.235)
-----------------------------------------
    |
-----------------------------------------
[host2 eth0] (192.168.121.161)
    |
[docker bridge0]
    |
weave2 container

Avahi's reflector should forward mDNS queries/responses between bridge0 and eth0.

Traffic at [bridge0 weave1] shows the local queries and responses from the local Weave daemon. No queries or reponses from other Weaves are shown:

$ sudo tcpdump -i bridge0  -n "multicast and port mdns"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bridge0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:55:48.638939 IP 172.17.51.1.36227 > 224.0.0.251.5353: 672 A (QM)? testgroup.weave.local. (39)
16:55:48.639157 IP 172.17.51.2.49429 > 224.0.0.251.5353: 672 2/0/0 A 192.168.121.235, A 192.168.40.11 (113)
16:55:51.272786 IP 172.17.51.2.49877 > 224.0.0.251.5353: 48766 A (QM)? testgroup.weave.local. (39)
16:55:51.273046 IP 172.17.51.2.49429 > 224.0.0.251.5353: 48766 2/0/0 A 192.168.121.235, A 192.168.40.11 (113)
16:55:53.639096 IP 172.17.51.1.36227 > 224.0.0.251.5353: 674 A (QM)? testgroup.weave.local. (39)
16:55:53.639263 IP 172.17.51.2.49429 > 224.0.0.251.5353: 674 2/0/0 A 192.168.121.235, A 192.168.40.11 (113)

There is something strange here: queries some times come from 51.1 and some other times come from 51.2...

Capturing at [host1 eth0] we can see the queries from both Weaves:

$ sudo tcpdump -i eth0  -n "multicast and port mdns"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
17:13:41.311927 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1101 A (QM)? testgroup.weave.local. (39)
17:13:43.679386 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1092 A (QM)? testgroup.weave.local. (39)
17:13:46.312138 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1103 A (QM)? testgroup.weave.local. (39)
17:13:48.679592 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1094 A (QM)? testgroup.weave.local. (39)
17:13:51.312333 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1105 A (QM)? testgroup.weave.local. (39)
17:13:53.679789 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1096 A (QM)? testgroup.weave.local. (39)
17:13:56.312535 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1107 A (QM)? testgroup.weave.local. (39)
17:13:58.679991 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1098 A (QM)? testgroup.weave.local. (39)
17:14:01.312740 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1109 A (QM)? testgroup.weave.local. (39)
17:14:03.680210 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1100 A (QM)? testgroup.weave.local. (39)
17:14:06.312941 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1111 A (QM)? testgroup.weave.local. (39)
17:14:08.680429 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1102 A (QM)? testgroup.weave.local. (39)

I expected to see the queries from weave1, and to see the queries from weave2 is a good signal too, but it is weird not to see replies at least from weave1. Why are queries being forwarded to eth0 while replies are not? I've been playing with the iptables with no luck (yet)...

And capturing at the link between host1 and host2 we can see the same: queries from both Weaves.

$ sudo tcpdump -i vnet0 "multicast and port mdns"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:16:43.688113 IP 192.168.121.161.44304 > 224.0.0.251.mdns: 1164 A (QM)? testgroup.weave.local. (39)
18:16:46.321314 IP 192.168.121.235.36227 > 224.0.0.251.mdns: 1175 A (QM)? testgroup.weave.local. (39)
18:16:48.688234 IP 192.168.121.161.44304 > 224.0.0.251.mdns: 1166 A (QM)? testgroup.weave.local. (39)
18:16:51.321518 IP 192.168.121.235.36227 > 224.0.0.251.mdns: 1177 A (QM)? testgroup.weave.local. (39)
18:16:53.688438 IP 192.168.121.161.44304 > 224.0.0.251.mdns: 1168 A (QM)? testgroup.weave.local. (39)

So I'm puzzled by this scenario where some multicast packets (the queries) are forwarded while some others (the replies) are not... I will continue with this investigation.

bboreham commented 9 years ago

Can you post the output of iptables-save or equivalent (multiple iptables -L) on host1 and host2 please? It may give some extra hints.

inercia commented 9 years ago

The iptables dump at [host1] is:

# Generated by iptables-save v1.4.21 on Mon Dec 22 17:09:55 2014
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.51.0/24 ! -o bridge0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A DOCKER ! -i bridge0 -p tcp -m tcp --dport 6783 -j DNAT --to-destination 172.17.51.2:6783
-A DOCKER ! -i bridge0 -p udp -m udp --dport 6783 -j DNAT --to-destination 172.17.51.2:6783
COMMIT
# Completed on Mon Dec 22 17:09:55 2014
# Generated by iptables-save v1.4.21 on Mon Dec 22 17:09:55 2014
*filter
:INPUT ACCEPT [98:6342]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [66:6118]
-A FORWARD -d 172.17.51.2/32 ! -i bridge0 -o bridge0 -p udp -m udp --dport 6783 -j ACCEPT
-A FORWARD -d 172.17.51.2/32 ! -i bridge0 -o bridge0 -p tcp -m tcp --dport 6783 -j ACCEPT
-A FORWARD -o bridge0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i bridge0 ! -o bridge0 -j ACCEPT
-A FORWARD -i bridge0 -o bridge0 -j ACCEPT
-A FORWARD -i weave -o weave -j ACCEPT
COMMIT
# Completed on Mon Dec 22 17:09:55 2014

Very similar to [host2]:

# Generated by iptables-save v1.4.21 on Mon Dec 22 17:13:59 2014
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.52.0/24 ! -o bridge0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A DOCKER ! -i bridge0 -p tcp -m tcp --dport 6783 -j DNAT --to-destination 172.17.52.2:6783
-A DOCKER ! -i bridge0 -p udp -m udp --dport 6783 -j DNAT --to-destination 172.17.52.2:6783
COMMIT
# Completed on Mon Dec 22 17:13:59 2014
# Generated by iptables-save v1.4.21 on Mon Dec 22 17:13:59 2014
*filter
:INPUT ACCEPT [393:33463]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [163:12543]
-A FORWARD -d 172.17.52.2/32 ! -i bridge0 -o bridge0 -p udp -m udp --dport 6783 -j ACCEPT
-A FORWARD -d 172.17.52.2/32 ! -i bridge0 -o bridge0 -p tcp -m tcp --dport 6783 -j ACCEPT
-A FORWARD -o bridge0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i bridge0 ! -o bridge0 -j ACCEPT
-A FORWARD -i bridge0 -o bridge0 -j ACCEPT
-A FORWARD -i weave -o weave -j ACCEPT
COMMIT
# Completed on Mon Dec 22 17:13:59 2014
inercia commented 9 years ago

By the way, is there any particular reason for running the weaver inside a container? Maybe it would be more useful to run in directly in the host: packaging is very simple anyway (with no dependencies), and I see little benefit from keeping it isolated from the environment and, in fact, it would probably make some things easier (like this)...

rade commented 9 years ago

The router runs in a container for three reasons:

  1. ease of managing dependencies; while the router currently does indeed have zero dependencies, and we like to keep it that way, we cannot guarantee it
  2. ease of installation, lifecycle management and upgrade; even with zero dependencies there would still be more for users to install, keep running, keep up to date, etc. And the canonical way for doing so outside docker varies by distribution.
  3. ability to run in "pure docker" environments, where docker containers are the only thing that can be installed/executed; we aren't quite there, due to the need for the weave script, but we aren't far off

We could conceivably, and as a last resort, run the router with --net=host. Would that help?

bboreham commented 9 years ago

Thanks for the iptables dumps, @inercia, but sadly I can't see anything in them that would cause your symptoms.

Our 'smoke tests' include a simple 2-host DNS check: https://github.com/zettio/weave/blob/master/test/200_dns_test.sh; maybe if you got that to run you could work backwards to why your test isn't working?

inercia commented 9 years ago

Hi guys,

I've tried several things and the mDNS feature runs fine as long as I launch the weaver with --net=host, but I don't really like this solution: this kind of exception just for the mDNS case does not seem very elegant...

The other alternative would be to add the appropriate iptables rules, but this would probably require duplicating multicast packets: they cannot be just routed due to TTL=1 for mDNS packets, and they cannot be just redirected (they also need to reach the host machine). So doing some iptables engineering, by adding more and more rules to the weave script, does not seem an elegant solution either: too much complexity, difficult to maintain, etc...

And this was the original reason for asking about moving weaver from a container to the host!

In my opinion, the weave script is keeping too much logic, and it will probably grow in the near future, so it will have to deal with multiple OSes, toolsets, commands and so, and doing networking with a bash script is not a lot of fun...

I also think this model imposes a somehow rigid flow in the system as some setup is done by the script and, once the weaver takes control, I don't think the system can modify some things (I must admit I'm not an expert here). I can imagine some changes in the routing table, device that go up or down, etc... and I'm not sure if Weave could react to this kind of things in its current form. In conclusion, I don't know if it is really good to be so isolated from the host for something like Weave, and running a container with full host access does not make sense either...

I fully understand that deploying Weave as a container makes a lot of sense in a containers world, but I also think that host packaging could solve some of the current problems of Weave in a more robust way. Packaging is not so difficult these days with tools like fpm (even for multiple distributions), and I think it would provide much better versions and dependencies control.

rade commented 9 years ago

the mDNS feature runs fine as long as I launch the weaver with --net=host, but I don't really like this solution: this kind of exception just for the mDNS case does not seem very elegant...

How about running the mDNS discovery as a separate container, with --net=host? That container could then tell the weave router what to connect to, via the HTTP call that is underneath weave connect.

In my opinion, the weave script is keeping too much logic

Thanks for reminding me to file an issue for that.

inercia commented 9 years ago

Adding a new container could solve the problem, but I'm wondering about some problems that could arise, like:

rade commented 9 years ago

who is going to be responsible for that container? who will keep it up and running?

Same answer as for the weave and weavedns containers :)

if some higher level software is responsible for controlling containers, could it be a problem to have so many weaver-initiated containers?

Don't think so. Nobody has reported problems with weave+weavedns so far in this regard. More generally, weave is definitely going to grow, and having multiple container with distinct responsibilities is preferable to having a single uber container. So whatever problems might be encountered in such a configuration will just have to be fixed.

how do we keep compatibility consistency between this new container and the weaver container?

As of #306 the script ensures that the images have the same version as itself.

rade commented 9 years ago

I've tried several things and the mDNS feature runs fine as long as I launch the weaver with --net=host, but I don't really like this solution: this kind of exception just for the mDNS case does not seem very elegant...

FDP (#1438) has to start weave with --net=host, so the above is no longer an issue.

I wonder whether port scanning would work on AWS/GCE :) Though apparently that is against the T&Cs.

One other thing to consider: --init-peer-count... would be lovely if a user didn't have to specify that, but let's not inflate this issue.

inercia commented 9 years ago

FDP (#1438) has to start weave with --net=host, so the above is no longer an issue.

Running things with --net=host would make things much easier.

I wonder whether port scanning would work on AWS/GCE :) Though apparently that is against the T&Cs.

I'm not sure this is the right path to follow now that we have Discovery. I think that peer discovery with the help of an external entity (ie, the Swarm token thing) provides a more general solution. It is a pity Docker has not taken their token idea a step further and make it a NAT helper...

One other thing to consider: --init-peer-count... would be lovely if a user didn't have to specify that, but let's not inflate this issue.

If it could be changed dynamically...