yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.06k stars 1.09k forks source link

[DocDB] Raft leader should reject updates from a leader with same op index but different term #23453

Open Huqicheng opened 3 months ago

Huqicheng commented 3 months ago

Jira Link: DB-12375

Description

If majority of the tablet peers go down at the same time, then there are cases when RAFT can run into split brain situation, leading to undereplicated tablets (which blocks load balancer activity). Even if that happens, RAFT should ensure there's no ambiguity and resolve the leadership changes.

Consider an RF-3 configuration with: N1 (Leader), N2 (follower 1) and N3 (follower 2). If majority nodes crash and come back up, If N2 (follower 1) becomes the new leader, then N1 (old leader) can accept updates from the N2(new leader) even though their WALs are diverged. This can cause the RocksDB on Node N1 (old leader) to potentially become out of sync with the WAL on N1, thus is it not safe to let N1 continue to be part of the Quorum. It needs to be removed from the quorum. More details below:

  1. N1(older leader) has 15.101 as the last committed op id, N2(follower1) and N2(follower2) has 15.101
    N1(leader): 15.100 (disk) 15.101 (disk)
    N2(follower1): 15.100 (disk) 15.101 (memory)
    N3(follower2): 15.100 (disk) 15.101 (memory)
  2. N2 and N3 hasn't flushed the op 15.101 to disk. N2 and N3 crashed and restarted, N2 and N3 may no longer have 15.101 as the OS flush might not have completed.
    N1(leader): 15.100 (disk) 15.101 (disk)
    N2(follower1): 15.100 (disk) 15.101 (lost)
    N3(follower2): 15.100 (disk) 15.101 (lost)
  3. N2 and N3 can form the quorum with term 16, and originally they should both agree on 15.100, and N2 becomes new leader of term 16.
  4. N2 (new leader) replicates 16.101 as the NoOp of term 16 to N3(follower2) and N1(old leader).
    N1(old leader): 15.100 (disk) 15.101 (disk)
    New quorum is formed by the original followers, new writes can happen:
    N2(new leader): 15.100 (disk) 16.101 (NoOp) 16.102(new write)
    N3(follower2): 15.100 (disk) 16.101 (NoOp) 16.102(new write)
  5. The old leader accepts updates from new leader, this can lead to split brain situation. N2(new leader) has writes 15.100 and 16.102 N1(old leader) has writes 15.100, 15.101 and 16.102 If the old leader is elected as leader again, user can read different results.
    N2(new leader) =raft heartbeat=> N1(old leader) with entries (16.101 16.102)
    N1(old leader) : 15.100 15.101 (16.101 deduplicated) 16.102 (new write)
    N2(new leader): 15.100 (disk) 16.101 (NoOp) 16.102(new write)
    N3(follower2): 15.100 (disk) 16.101 (NoOp) 16.102(new write)

The problems mentioned in step (5) can be avoided.

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

Huqicheng commented 1 month ago

We can prevent this from happening by enabling FLAGS_durable_wal_write. (Might degrade the perf, need test) Without this gflag enabled, the peer should be able to detect an op that it has received in a previous term, should reject and surface it to the leader.

Also, should disallow RBS from this peer.