Open rade opened 9 years ago
I'm afraid this will not work on AWS or GCE. But it seems that some other providers (like RackSpace) do support multicast. I think you can expect corporate data centres to support multicast...
I think this feature would be relatively cheap to implement as Weave already has a mDNS client...
Hmm. That makes it significantly less attractive.
I suppose this could work in a hybrid way, i.e. weave would connect to any address specified on the command line, as currently, but also attempt to find peers via zeroconf. That way when, say, I have a weave network that spans multiple data centres and AWS, zeroconf is used for discovery inside the data centres, but the existing mechanism is used inside AWS and across data centres.
Yes, we could use a hybrid solution. Weave could be started with something like weave launch --use-mdns 10.0.1.1
. Weave could then start a mDSN client and look for something like "weave.local"
(or some other name, I don't know if this would conflict with WeaveDNS...). But once a peer has been found, the Weave node should have a way of knowing if the new, discovered peer is working on the same virtual switch... Maybe all the peers share a UUID or something...
I think this is related. A requirement I have is to incrementally add nodes to a weave network. So initially the network might consist of a few nodes in AWS region-1 and gradually we would add nodes in AWS region-2 and expect all the nodes to be able to talk to each other.
I'm ok with initially issuing commands whenever each node is added so that existing nodes are aware of a newly added node.
But there needs to be a way to dynamically add nodes to an existing network (I couldn't figure out from documentation that it this was already supported).
there needs to be a way to dynamically add nodes to an existing network
There is. It's fundamental to how weave works.
I couldn't figure out from documentation that it this was already supported
http://zettio.github.io/weave/features.html#dynamic-topologies
I'm looking at the mDNS code in the /nameserver
, and I have found that the client only waits for the first reponse, making it difficult to register several IPs for the same name (ie, "somegroup.weave.local."), specially since the first response is usually the local node... Any idea on how this could be solved? Maybe the resolver could exhaust the timeout waiting for responses...
See #225 and #226.
I agree that that kind of DNS resolution would be good thing as a general case, but I don't think it would be necessary for this kind of local peers discovery because latency is not a problem. We could have a background goroutine that periodically queries for peers with a given DNS name. For example, if we try to get all the A-records for "weave.local" every 5 seconds, we can do a timed, blocking resolution and wait up to 5 seconds for answers...
Yes, for local peer discovery we can't only accept the first answer, since then we'd risk constructing multiple disconnected networks.
I don't see why you'd do a timed, blocking resolution though. Surely we'd be happy to accept any answer, whenever it came it, and add it to our list of endpoints we know about. We do need to work out how frequently to ask though, in order to cope with the question (or answers) getting lost.
I mean that this mDNS peers discoverer should send multicast queries and block (not in the strict sense) waiting for responses for some time X, then send another query and wait again, and so on...
Maybe the mdns_client
could be modified for this kind of queries, so that the inflight is not cleaned up and the responses channel is not closed after getting new mDNS responses... A new flag to SendQuery
could be added (or maybe a new function, something like PersistentQuery
), and users should explicitly call PersistentQueryCancel
when they are done...
I'm stuck with the iptables stuff for multicast traffic between different containers in different hosts, maybe related to this issue. Forward path works fine (ie, sending mDNS queries), but responses never come back (ie, DNS A). I will continue investigating on this...
I often think this is caused by Reverse Path Filtering, i.e. if Linux doesn't think packets from that destination should be reaching you, then it will throw them away.
Examples include: send and receive addresses are the same, packets coming from a subnet which doesn't route to your subnet.
You can turn off reverse path filtering to see if this is the case, which may lead to deeper understanding of the real problem. (Generally we want Weaveworks software to work with Linux defaults)
I have tried with disabling the RP filtering as you suggested, @bboreham, but it does not seem to help. I have tried to setup the easiest scenario with the help of Avahi's mDNS reflector (basically, a mDNS proxy) but, even with this, it does not work. My testing setup is this:
weave1 container (172.17.51.2)
|
[docker bridge0] (172.17.51.1)
|
[host1 eth0] (192.168.121.235)
-----------------------------------------
|
-----------------------------------------
[host2 eth0] (192.168.121.161)
|
[docker bridge0]
|
weave2 container
Avahi's reflector should forward mDNS queries/responses between bridge0
and eth0
.
Traffic at [bridge0 weave1]
shows the local queries and responses from the local Weave daemon. No queries or reponses from other Weaves are shown:
$ sudo tcpdump -i bridge0 -n "multicast and port mdns"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bridge0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:55:48.638939 IP 172.17.51.1.36227 > 224.0.0.251.5353: 672 A (QM)? testgroup.weave.local. (39)
16:55:48.639157 IP 172.17.51.2.49429 > 224.0.0.251.5353: 672 2/0/0 A 192.168.121.235, A 192.168.40.11 (113)
16:55:51.272786 IP 172.17.51.2.49877 > 224.0.0.251.5353: 48766 A (QM)? testgroup.weave.local. (39)
16:55:51.273046 IP 172.17.51.2.49429 > 224.0.0.251.5353: 48766 2/0/0 A 192.168.121.235, A 192.168.40.11 (113)
16:55:53.639096 IP 172.17.51.1.36227 > 224.0.0.251.5353: 674 A (QM)? testgroup.weave.local. (39)
16:55:53.639263 IP 172.17.51.2.49429 > 224.0.0.251.5353: 674 2/0/0 A 192.168.121.235, A 192.168.40.11 (113)
There is something strange here: queries some times come from 51.1 and some other times come from 51.2...
Capturing at [host1 eth0]
we can see the queries from both Weaves:
$ sudo tcpdump -i eth0 -n "multicast and port mdns"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
17:13:41.311927 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1101 A (QM)? testgroup.weave.local. (39)
17:13:43.679386 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1092 A (QM)? testgroup.weave.local. (39)
17:13:46.312138 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1103 A (QM)? testgroup.weave.local. (39)
17:13:48.679592 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1094 A (QM)? testgroup.weave.local. (39)
17:13:51.312333 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1105 A (QM)? testgroup.weave.local. (39)
17:13:53.679789 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1096 A (QM)? testgroup.weave.local. (39)
17:13:56.312535 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1107 A (QM)? testgroup.weave.local. (39)
17:13:58.679991 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1098 A (QM)? testgroup.weave.local. (39)
17:14:01.312740 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1109 A (QM)? testgroup.weave.local. (39)
17:14:03.680210 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1100 A (QM)? testgroup.weave.local. (39)
17:14:06.312941 IP 192.168.121.235.36227 > 224.0.0.251.5353: 1111 A (QM)? testgroup.weave.local. (39)
17:14:08.680429 IP 192.168.121.161.44304 > 224.0.0.251.5353: 1102 A (QM)? testgroup.weave.local. (39)
I expected to see the queries from weave1
, and to see the queries from weave2
is a good signal too, but it is weird not to see replies at least from weave1
. Why are queries being forwarded to eth0
while replies are not? I've been playing with the iptables
with no luck (yet)...
And capturing at the link between host1
and host2
we can see the same: queries from both Weaves.
$ sudo tcpdump -i vnet0 "multicast and port mdns"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vnet0, link-type EN10MB (Ethernet), capture size 262144 bytes
18:16:43.688113 IP 192.168.121.161.44304 > 224.0.0.251.mdns: 1164 A (QM)? testgroup.weave.local. (39)
18:16:46.321314 IP 192.168.121.235.36227 > 224.0.0.251.mdns: 1175 A (QM)? testgroup.weave.local. (39)
18:16:48.688234 IP 192.168.121.161.44304 > 224.0.0.251.mdns: 1166 A (QM)? testgroup.weave.local. (39)
18:16:51.321518 IP 192.168.121.235.36227 > 224.0.0.251.mdns: 1177 A (QM)? testgroup.weave.local. (39)
18:16:53.688438 IP 192.168.121.161.44304 > 224.0.0.251.mdns: 1168 A (QM)? testgroup.weave.local. (39)
So I'm puzzled by this scenario where some multicast packets (the queries) are forwarded while some others (the replies) are not... I will continue with this investigation.
Can you post the output of iptables-save
or equivalent (multiple iptables -L
) on host1 and host2 please? It may give some extra hints.
The iptables dump at [host1]
is:
# Generated by iptables-save v1.4.21 on Mon Dec 22 17:09:55 2014
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.51.0/24 ! -o bridge0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A DOCKER ! -i bridge0 -p tcp -m tcp --dport 6783 -j DNAT --to-destination 172.17.51.2:6783
-A DOCKER ! -i bridge0 -p udp -m udp --dport 6783 -j DNAT --to-destination 172.17.51.2:6783
COMMIT
# Completed on Mon Dec 22 17:09:55 2014
# Generated by iptables-save v1.4.21 on Mon Dec 22 17:09:55 2014
*filter
:INPUT ACCEPT [98:6342]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [66:6118]
-A FORWARD -d 172.17.51.2/32 ! -i bridge0 -o bridge0 -p udp -m udp --dport 6783 -j ACCEPT
-A FORWARD -d 172.17.51.2/32 ! -i bridge0 -o bridge0 -p tcp -m tcp --dport 6783 -j ACCEPT
-A FORWARD -o bridge0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i bridge0 ! -o bridge0 -j ACCEPT
-A FORWARD -i bridge0 -o bridge0 -j ACCEPT
-A FORWARD -i weave -o weave -j ACCEPT
COMMIT
# Completed on Mon Dec 22 17:09:55 2014
Very similar to [host2]
:
# Generated by iptables-save v1.4.21 on Mon Dec 22 17:13:59 2014
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
:WEAVE - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.52.0/24 ! -o bridge0 -j MASQUERADE
-A POSTROUTING -j WEAVE
-A DOCKER ! -i bridge0 -p tcp -m tcp --dport 6783 -j DNAT --to-destination 172.17.52.2:6783
-A DOCKER ! -i bridge0 -p udp -m udp --dport 6783 -j DNAT --to-destination 172.17.52.2:6783
COMMIT
# Completed on Mon Dec 22 17:13:59 2014
# Generated by iptables-save v1.4.21 on Mon Dec 22 17:13:59 2014
*filter
:INPUT ACCEPT [393:33463]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [163:12543]
-A FORWARD -d 172.17.52.2/32 ! -i bridge0 -o bridge0 -p udp -m udp --dport 6783 -j ACCEPT
-A FORWARD -d 172.17.52.2/32 ! -i bridge0 -o bridge0 -p tcp -m tcp --dport 6783 -j ACCEPT
-A FORWARD -o bridge0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -i bridge0 ! -o bridge0 -j ACCEPT
-A FORWARD -i bridge0 -o bridge0 -j ACCEPT
-A FORWARD -i weave -o weave -j ACCEPT
COMMIT
# Completed on Mon Dec 22 17:13:59 2014
By the way, is there any particular reason for running the weaver
inside a container? Maybe it would be more useful to run in directly in the host: packaging is very simple anyway (with no dependencies), and I see little benefit from keeping it isolated from the environment and, in fact, it would probably make some things easier (like this)...
The router runs in a container for three reasons:
We could conceivably, and as a last resort, run the router with --net=host
. Would that help?
Thanks for the iptables dumps, @inercia, but sadly I can't see anything in them that would cause your symptoms.
Our 'smoke tests' include a simple 2-host DNS check: https://github.com/zettio/weave/blob/master/test/200_dns_test.sh; maybe if you got that to run you could work backwards to why your test isn't working?
Hi guys,
I've tried several things and the mDNS feature runs fine as long as I launch the weaver with --net=host
, but I don't really like this solution: this kind of exception just for the mDNS case does not seem very elegant...
The other alternative would be to add the appropriate iptables rules, but this would probably require duplicating multicast packets: they cannot be just routed due to TTL=1 for mDNS packets, and they cannot be just redirected (they also need to reach the host machine). So doing some iptables engineering, by adding more and more rules to the weave
script, does not seem an elegant solution either: too much complexity, difficult to maintain, etc...
And this was the original reason for asking about moving weaver from a container to the host!
In my opinion, the weave
script is keeping too much logic, and it will probably grow in the near future, so it will have to deal with multiple OSes, toolsets, commands and so, and doing networking with a bash script is not a lot of fun...
I also think this model imposes a somehow rigid flow in the system as some setup is done by the script and, once the weaver
takes control, I don't think the system can modify some things (I must admit I'm not an expert here). I can imagine some changes in the routing table, device that go up or down, etc... and I'm not sure if Weave could react to this kind of things in its current form. In conclusion, I don't know if it is really good to be so isolated from the host for something like Weave, and running a container with full host access does not make sense either...
I fully understand that deploying Weave as a container makes a lot of sense in a containers world, but I also think that host packaging could solve some of the current problems of Weave in a more robust way. Packaging is not so difficult these days with tools like fpm (even for multiple distributions), and I think it would provide much better versions and dependencies control.
the mDNS feature runs fine as long as I launch the weaver with
--net=host
, but I don't really like this solution: this kind of exception just for the mDNS case does not seem very elegant...
How about running the mDNS discovery as a separate container, with --net=host
? That container could then tell the weave router what to connect to, via the HTTP call that is underneath weave connect
.
In my opinion, the
weave
script is keeping too much logic
Thanks for reminding me to file an issue for that.
Adding a new container could solve the problem, but I'm wondering about some problems that could arise, like:
Required
-version rules?who is going to be responsible for that container? who will keep it up and running?
Same answer as for the weave and weavedns containers :)
if some higher level software is responsible for controlling containers, could it be a problem to have so many weaver-initiated containers?
Don't think so. Nobody has reported problems with weave+weavedns so far in this regard. More generally, weave is definitely going to grow, and having multiple container with distinct responsibilities is preferable to having a single uber container. So whatever problems might be encountered in such a configuration will just have to be fixed.
how do we keep compatibility consistency between this new container and the weaver container?
As of #306 the script ensures that the images have the same version as itself.
I've tried several things and the mDNS feature runs fine as long as I launch the weaver with
--net=host
, but I don't really like this solution: this kind of exception just for the mDNS case does not seem very elegant...
FDP (#1438) has to start weave with --net=host
, so the above is no longer an issue.
I wonder whether port scanning would work on AWS/GCE :) Though apparently that is against the T&Cs.
One other thing to consider: --init-peer-count
... would be lovely if a user didn't have to specify that, but let's not inflate this issue.
FDP (#1438) has to start weave with --net=host, so the above is no longer an issue.
Running things with --net=host
would make things much easier.
I wonder whether port scanning would work on AWS/GCE :) Though apparently that is against the T&Cs.
I'm not sure this is the right path to follow now that we have Discovery. I think that peer discovery with the help of an external entity (ie, the Swarm token thing) provides a more general solution. It is a pity Docker has not taken their token idea a step further and make it a NAT helper...
One other thing to consider:
--init-peer-count
... would be lovely if a user didn't have to specify that, but let's not inflate this issue.
If it could be changed dynamically...
At present, in order to establish a weave network, new peers need to be told about at least one other peer, or vice versa.
We could introduce a mode of operation where weave peers discover each other using, say, mDNS. Underlying networks permitting.
Main question is how broadly supported the various rendezvous technologies are. Will it work on AWS? GCE? Across availability zones? Typical corporate data centres?
Suggested by @inercia.