weaveworks / mesh

A tool for building distributed applications.
Apache License 2.0
880 stars 107 forks source link

Deal better with restarted peers #29

Closed bboreham closed 8 years ago

bboreham commented 8 years ago

Improves the behaviour reported at #28.

rade commented 8 years ago

I reckon the isConnectedPeer should be lifted out, i.e. the flag passed to doAddConnection should simply be an indication as to whether this is a "new" peer, where "new" includes "restarted".

That indicator should be calculated by registerRemote: a peer is "new" if a) it is not present in the topology, or b) is present and has a different UID, or c) is present and has no connections.

rade commented 8 years ago

c) is present and has no connections.

We need this is because we may, temporarily, have multiple connections in progress to the same peer, which will keep it in the topology even though it may not have any (established) connections.

Looking at this again, perhaps we should leave the code in this PR as is. After all, it does address the problem, and is quite straightforward.

bboreham commented 8 years ago

I have rebased and made the change suggested at https://github.com/weaveworks/mesh/pull/29#issuecomment-199783229.

A couple of aspects are now potentially hard to follow: the computation of isJustAddedPeer which relies on the behaviour of fetchWithDefault, and the absence of locking round len(conn.remote.connections), which (I think) relies on remote not being the local peer and connections never being modified after creation on a remote peer.

CircleCI tests have failed, which seems to be in the etcd code and unrelated.

rade commented 8 years ago

As discussed, the "has no connections" tests is probably too narrow. When two clusters come together, a 2nd in-progress connection between them may well find the other peer in the topology, with lots of connections to peers on its side of the cluster. We'd still need to send it the full gossip, so it can learn about our side of the cluster.

The symmetric unicast route check of the previous version is a conservative approximation of whether another peer is part of our cluster, which is kinda what we want here.

So I suggest we ditch the last commit.

bboreham commented 8 years ago

I removed the last commit, so now we are back to checking the unicast topology.

rade commented 8 years ago

LGTM. @peterbourgon what's up with the CI failures?

peterbourgon commented 8 years ago

@rade It's the vendoring nonsense. Needs a rebase on master to get this.

Oops, it was the Circle build cache. How annoying.