zhebrak / raftos

Asynchronous replication framework for distributed Python projects
MIT License
350 stars 58 forks source link

Consensus algorithm with leader-election only - possible danger #4

Open bakwc opened 8 years ago

bakwc commented 8 years ago

In case you only need consensus algorithm with leader election

There is an issue with using only leader election. There is no guarantee that there is only one leader at any time moment. Raft does not provide such a guarantee. It provides guarantee that there is always only one leader with a given term, it's not the same. For example, when your cluster is splitted into several parts due to some network issues, a part with the current leader will still think that leader has not changed, and another part can elect another leader. Old leader will not be able to commit any transaction, but still - he will have a "leader" status. So I suggest either to remove this paragraph from readme, or add a warning there.

zhebrak commented 8 years ago

Thank you! Do you think there is a way to avoid split leading with election-only case? For example leader can check whether he has a majority of followers at any given moment.

bakwc commented 8 years ago

Maybe. The first approach is to collect append-entries response times from each follower, and if we don't have a majority of followers who's response time is greater than (currentTime - raftMinTimeout) - fallback to follower state. But what if a packet with a response from some dead node would travel very long time (more than raftMinTimeout)? We need some unique identifier for all append-entries requests to check if append-entries response matches latest append-entries request. Still not sure if this is safe. BTW, I have similar issue, still thinking about correct way to implement this.

bakwc commented 8 years ago

Started discussion https://groups.google.com/forum/#!topic/raft-dev/j2_w7JDcTaY

zhebrak commented 8 years ago

We also have to make sure that step-down timeout is lower then minimal election timeout since we don't want a situation when former leader is not yet stepped down but new cluster already elected the new one. The difference between step-down timeout and expected election timeout will be downtime without a leader though. I guess the solution here is to step down after N heartbeats without majority of responses and start election timeout from N heartbeats.

<step down timeout> = N * <heartbeat timeout>
<election timeout> = rand[N * <heartbeat timeout>, m * N * <heartbeat timeout>]

In case of m = 3 we'll have expected downtime at N * <heartbeat timeout>

zhebrak commented 8 years ago

I've added a fix here but still have to accurately test it.

bakwc commented 8 years ago

I'm affraid it also does not provide strong guarantee that there would be no second leader for a short period of time. For example some node lost connection with other nodes. It becomes a candidate and start leader election for multiple times (while connection is lost). When connection comes back - that node can become a new leader very fast (less than stepDownTimeout at previous leader).

zhebrak commented 8 years ago

You are right, in this case old leader that didn't lose connection will know that it's term is outdated as soon as it'll receive response from converted node, at most heartbeat interval + round trip. Or it can be disconnected right away and we'll have two leaders for at most step down interval.

zhebrak commented 8 years ago

Doesn't look like solvable issue with this algorithm of election. We can introduce silent period for a new leader so it won't answer to client's requests. It'll wait for step down timeout when there is no other leader around. Availability slightly goes away though.

bakwc commented 7 years ago

I'd like to suggest not to use leader check at all - instead you can implement distributed lock algorithm over raft, see https://github.com/bakwc/PySyncObj#lock (it can be implemented on top of the KV storage - if some value is set - it means that lock is acquired).