weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 668 forks source link

delayed acceptance of remote entries due to unfortunate gossip ordering #2013

Open rade opened 8 years ago

rade commented 8 years ago

I've just come across a case where a peer (re)joining a network of other peers, in a chain-topology (enforced through --no-discover), i.e. 'C' joined, 'A <--> B', forming 'A <--> B <--> C', took ~1 minute to accept the DNS entries from the farthest peer (A).

This is odd, since when a new peer connects we send it all gossip. So B should have sent everything to C and vice versa. And the logs indicate that this did indeed happen. So why where A's DNS entries missing? Then I found this in Nameserver.receiveGossip:

    gossip.Entries.filter(func(e *Entry) bool {
        return n.isKnownPeer(e.Origin)
    })

This removes all entries from received gossip for which the peer is unknown to us.

Why would A be unknown to C? Because the "sending all gossip" sends the gossip for all channels in a random order. So if the DNS gossip arrives before the topology gossip then C won't have learnt about A yet when it encounters A's DNS entries. Well, that's my theory anyway.

rade commented 8 years ago

I suspect the problem also exists in the other direction, i.e. A might receive DNS gossip containing entries for C before it has has received topology gossip telling it about the existence of C.

So simply changing the ordering in "sending all gossip" to make topology come first won't solve the problem, since the gossip about C that A receives is as the result of forwarding by B, not "sending all gossip".