Unavailability - Githubissues

Unavailability can be caused by unexpected failure to reach a node, or node taking offline for maintenance.

Hinted handoff for temporary failures: If a node S just temporarily went offline, replicas that are intended for S is forwarded to S'. S' is chosen from outside the preferred list of N, say the N+1st node. S' then stores a special marker for this S's replica, (S, replica). When it detects that S becomes available again, it sends all S's replicas back to S.

Detection of failures is to use heartbeat ('are you alive?') + gossip ('i just learned that node F has failed').

Another consideration for failure recovery is the efficiency. When a node fails and its data has to be redistributed to other node(s). So one way to characterize efficiency in failure recovery is the load of the receiving node(s). For example, if only one node is to receive the failed node's whole load, then that this receiving node might be overwhelmed, and the requests for objects on this receiving node may experience delay. However, if the load of the failed node is distributed more evenly to more nodes, then we can expect each receiving node only experience a minor delay.

In consistent hashing, Dynamo uses the mapping of a physical node to multiple virtual nodes to balance the load for such network topology changes caused by unavailability.
In rendezvous hashing, it seems such load balancing for topology changes is already baked in.

zheguang / resdb

Unavailability #9