weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 671 forks source link

active/passive failover redundancy in weaver #630

Open abligh opened 9 years ago

abligh commented 9 years ago

I have a situation where I am using weaver to bridge a weave network onto a real VLAN, and I want to do so redundantly. I already have two 'bastion hosts' which negotiate master/slave configuration. The network looks like this (with apologies for markdown graphics):

peerW.....peerX.....peerY.....peerZ
 :       :             :       :
  :     :               :     :   full
   :   :                 :   :    mesh
    : :                   : :
    peerA................peerB
       |                  |
    ---+----+--------+----+--- VLAN
            |        |
        Host1        Host2

(For simplicity and to avoid markdown madness, I have not shown all links within the full mesh of peers)

The idea here is that the addresses of peerA and peerB are both given to to peers W, X, Y and Z. Together these 6 peers form a weave network. peerA and peerB, however, are attached to another network (let's say for the sake of argument a VLAN) on which are also located Host1 and Host2. The idea is to allow for L2 connectivity between containers W, X, Y, Z and physical hosts Host1 and Host2, using a redundant pair of gateways A and B (hosting peerA and peerB).

The problem with this setup is as follows. If a broadcast packet egresses from (e.g.) W, it will be transmitted (directly or indirectly) to peerA, which will emit the packet towards the VLAN so it can reach Host1 and Host2. peerB will also receive the packet, and transmit the packet on, inter alia to peerA, which will repeat the process, causing a packet loop (remember this is L2, so no hop counts). Similarly, peerB will loop packets in the other direction. Removing the peering between peerA and peerB does not help as Weave will simply transmit the looping packets via X, Y or Z, as weave does not require a full mesh. Similarly, a broadcast packet received by peerA from Host1 will be transmitted inter alia to peerB, where it will be retransmitted onto the VLAN, for it to be received again by peerA; again, removing the direct peering does not help.

As peerA and peerB already negotiate a master/slave relationship between them, one possibility is to only run weaver on the master (i.e. on peerA, or peerB, but not both). Whilst this solves the packet loop problem, it is a poor solution in terms of failover. The most significant issue is time to failover. Assume peerA is the master and peerB the slave, and peerA fails, meaning peerB is elected master. As peers W, X, Y and Z will have had peerB's peer data for a long while and failed to contact it (as it will not have been running weaver whilst a slave), peerB is unlikely to be contacted by W, X, Y, or Z for a relatively long period of time; of course peerB may initiate contact, but if W, X, Y and Z are behind a NAT (as is likely in my scenario), such contact will fail; it will require W, X, Y, or Z to initiate contact to peerB which may take several minutes (if I've understood how timeouts work). When a peering is established, weave needs to update its internal topology, which may also take time. A second disadvantage is that in the normal condition (where peerA is master), to W, X, Y and Z, peerB appears dead, meaning there is no way to know whether failover will work unless and until peerA actually dies.

A better option would be to run the slave peer (peerB) here in a slave mode. In slave mode, the weaver process would listen on the pcap interface, but discard incoming packets (on the assumption the master would handle them). It would thus learn nothing through the incoming pcap interface, and it would transmit nothing (not even unlearnt traffic, e.g. broadcasts or unknown MACs). In effect, the pcap interface would be 'switched off'. When transitioning to master, the pcap interface would be switched on. When transitioning to slave, the pcap interface would be switched off again, and (ideally) all the learnt data in weave's equivalent of a distributed CAM table would be 'forgotten'. The master/slave status could be initiated through a command-line option and changed in real-time through the JSON interface.

If this is a good idea, I am happy to code this up. The pcap bit seems easy enough. I'm not, however, sure how I might go about persuading weave to 'forget' where MAC addresses are on the transition from slave to master.

rade commented 9 years ago

If a broadcast packet egresses from (e.g.) W, it will be transmitted (directly or indirectly) to peerA, which will emit the packet towards the VLAN so it can reach Host1 and Host2. peerB will also receive the packet, and transmit the packet on, inter alia to peerA.

Correction: The packet will reach peerA only once; weave's broadcast routing logic ensures that (except when the topology is in flux). However both peerA and peerB will inject the packet, which is what is causing the duplication you are seeing.

slave mode [...] the pcap interface would be 'switched off'

...for both capture and inject. That is already possible, by starting the router with a blank -iface parameter. And it would be quite straightforward to start the capture/inject later. What's missing is a) a way to disable a running capture/inject (including clearing the MAC cache), and b) hooks to do all this dynamically via the http api.

Neither of which would be hard, though there are some challenges, e.g. packet capture is a blocking call (I suppose there is no harm in performing some check just after that, but all this is in the critical path performance wise, so we want to do minimal locking and channel interaction).

As to whether it's a good idea overall... it does strike me as rather a niche feature which, unlike the rest of weave, requires some additional coordinator / health-checker to be of any use.

abligh commented 9 years ago

Correction: The packet will reach peerA only once; weave's broadcast routing logic ensures that (except when the topology is in flux). However both peerA and peerB will inject the packet, which is what is causing the duplication you are seeing.

Weave's logic is not the issue. The packet reaches peerA first from W, then peerA sends it to the VLAN. peerB receives it from the VLAN, and treats it as a completely new packet ingressing weave's network, and sends it to inter-alia peerA again, hence there is a loop (not merely duplication). Weave's logic is not to blame here as it doesn't expect one of its peers to be connected to another other than via weave. When the packet re-ingresses through peerB (from the VLAN), peerB has no way to distinguish it from any other packet ingressing from 'outside'; i.e. as far as weave is concerned, it's a second broadcast.

...for both capture and inject. That is already possible, by starting the router with a blank -iface parameter. And it would be quite straightforward to start the capture/inject later. What's missing is a) a way to disable a running capture/inject (including clearing the MAC cache), and b) hooks to do all this dynamically via the http api.

Neither of which would be hard, though there are some challenges, e.g. packet capture is a blocking call (I suppose there is no harm in performing some check just after that, but all this is in the critical path performance wise, so we want to do minimal locking and channel interaction).

As to whether it's a good idea overall... it does strike me as rather a niche feature which, unlike the rest of weave, requires some additional coordinator / health-checker to be of any use.

Thanks. I'll think about that.

It gives me three further ideas:

Off-the-top-of-my-head simple idea for an STP-like protocol (called ALONE - Avoid Loops On Networked Ethernet - as I can't immediately think of anything better):

Of course if there is already some shortest path algorithm that could be piggybacked, you could simply assume (if the protocol is switched on) an adjacency between every pcap interface in the system until something like the above proves otherwise.

rade commented 9 years ago

there is a loop (not merely duplication)

Got it. There's duplication because both peers inject onto the VLAN, and there is a loop because both peers will capture the packets the other injected.

restart weaver when going from master to slave or vice versa. [...] the topology would have to be rebuilt.

Right, and for the slave->master transition the restart, and associated topology recalculation, happens at the worst possible time.

Btw, I suspect that clearing weave's MAC cache when a peer is passivated isn't enough. Other peers will have MAC cache entries referencing that peer. Bouncing the peer would help here, since that will clear out those entries in most cases, especially in a fully connected mesh.

STP

That's a lot of work.

abligh commented 9 years ago

Btw, I suspect that clearing weave's MAC cache when a peer is passivated isn't enough.

Yeah I thought that might be the case. I sort of need to flood a 'forget this MAC' packet.

STP / STP-a-like

That's a lot of work.

STP-a-like - yes indeed. But not as much as doing the 'real' standards compliant fix. The 'real' fix would be treating the entire weave network as a bridge (i.e. a collection of half bridges), and running STP, then realising you need RSTP, deciding that didn't work because VLANs, then doing PVST/PVST+, throwing your hands up in horror as to the vendor interoperability and scaling issues, deciding TRILL is the solution, then finding the IS-IS component of that alone is 10 times the size of weave, and going back to a proprietary protocol. YMMV.

rade commented 9 years ago

I sort of need to flood a 'forget this MAC' packet.

Well, as I alluded to, peers do clear out entries from their MAC caches that refer to peers which are no longer part of the network. That's why bouncing a peer on the active->passive transition would work.

Perhaps this is less of an issue than I thought though. Given that what you are attempting to achieve here is redundancy, the active->passive transition would actually never occur. You'd have a peer failure, which does cause the MAC caches to clear as described above.

rade commented 9 years ago

@abligh in your example, how are peerA and peerB connected to the VLAN? Could you have just one of them connected at a time.

abligh commented 9 years ago

@rade typically they'd be listening on the VLAN device or (more usefully here) a veth which shared a bridge with the VLAN device. Selectively disconnecting the slave from the VLAN is what I meant by:

Leave both copies of weaver running normally but attempt to use some ebtables magic to cut off connectivity on the slave. Issues: MAC cache is not cleared; experimentation required to determine whether ebtables can be persuaded to block bidirectionally.

Apologies for being cryptic.