Designing Data-Intensive Applications (Part 2)

yew11 commented 4 years ago

Replication

replication means keep a copy of the same data on multiple machines that are connected via a network.

The difficulty in replication is handling changes of replicated data.

Three Algorithms for replicating changes b/w nodes, almost all use one of these three:

single-leader
multi- leader
leaderless

Leaders and Followers

With multiple replicas, how do we ensure that all the data ends up on all the replica?

The most common approach:

when clients want to write to the db, they must send their requests to the leader. which first writes the new data to its local storage.
When the leader writes new data to its local storage, it also sends the data changes to all followers as part of a replication or change stream.
when clients want to read from db, it can query any, but writes are only accepted on the leader.

Synchronous vs. Asynchronous replication Screen Shot 2020-01-07 at 8 23 04 PM Follower 1 is synchronous, while the follower 2 is asynchronous.

About synchronous:

followers is guaranteed to have an up-to-date copy of the data. If the leader fails, the data is still available on the followers.
Disadvantage is write cannot proceed if one follower fails.

For above reason, it is impractical for all followers to be synchronous, any outage, would cause the whole system to grind to halt.

In practice, if you enable synchronous replication on a database, it usually means that one of the followers is synchronous, and the others are asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous. This guarantees that you have an up-to-date copy of the data on at least two nodes: the leader and one synchronous follower. This is called semi-synchronous

Often, leader-based replication is configured to be completely asynchronous, this means a write is not guaranteed to be durable. The advantage is the leader can continue processing writes, even if all of its followers have fallen behind.

Setting up new followers

Simply copying data files from one node to another not sufficient: clients are constantly writing to the db, the data is always in flux. So the process is:

Take a consistent snapshot of leader's db at some point in time.
Copy the snapshot to the new follower node
The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken. This requires that the snapshot is associated with an exact position in the leader's replication log.
When the follower has processed the backlog of data changes since the snapshot, now it has caught up.

Handling Node outages

Goal: keep the system as a whole running despite individual node failures, and keep the impact of a node outage as small as possible.

Follower failure: Catch-up recovery On its local disk, each follower keeps a log of data changes it has received from the leader. If crashes, the follower can recover easily from its logs.

When the follower is online, it can requests data changes during the downtime.

Leader failure: Failover On of the followers needs to be promoted to be the new leader: clients need to write data to the new leader, and other followers need to start consuming data changes from the new leader. This is called failover

Determine if leader has failed: if the leader timed-out
Choose a new leader: the replica with the most up-to-date changes from the old leader.
Reconfigure the system to use the new leader. When the old leader comes back, it needs to be a follower.

Failover is fraught with things that can go wrong:

If asynchronous replication is used, the new leader may not have received all the writes from the old leader before it failed. The most common solution is for the old leader's unreplicated writes to be discarded.
Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents.
Split brain, both nodes can think they are the leader. So shut down one node if new leaders are detected.
What is the right timeout before the leader is declared dead?

Replication logs implementation

Statement-based replication

the leader logs every write request that it executes and sends that statement log to its followers (Every INSERT, UPDATE, DELETE).

problems:

any statement that calls a nondeterministic function, such as NOW() or RAND(), is likely to generate a different value.
If the statement use an auto incrementing column, or they depend on the existing data in the db, they must be executed in the same order
statement that have side effects (triggers, stored procedures, etc...), may result in different side effects.

It is possible the leader replace nondeterministic functions with a fixed value, but there are so many edges cases to consider.

Write-ahead log shipping Leader: log is an append-only sequence of bytes containing all writes to the db. When the follower process the log, it builds a copy of the exact same data structure as found on the leader.

problem: log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks: This makes replica closely coupled to the storage engine. Fucked if the db changes its storage format.

Logical (row-based) log replication

use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals.

A logical log for a relational db is usually a sequence of records at the granularity of a row.

For an inserted row, log contains the new values of all columns.
For a deleted row, contains information to uniquely identify the row (primary key or all the columns).
Updated row, new values of all columns.

Trigger-based replication A trigger lets you register custom application code that is automatically executed when a data changes occurs in a db system. But larger overhead and prone to bugs.

yew11 commented 4 years ago

Problems with Replication lag

Leader-based replication is great for read heavy applications.

If an application reads from an asynchronous follower, it may see outdated data. If we wait a while, all the follower will have the same data, Eventual Consistency

if the system is operating near capacity or if there is a problem in the network, the lag can easily increase.

Reading your own writes

Screen Shot 2020-01-09 at 8 13 53 PM

In this case, we need read-after-write consistency or read-your-writes consistency. If only guarantee for the current writer, not guaranteed for other users. How do we implement?

When reading something that user may have modified. Read it from the leader. otherwise, read it from the follower. So somehow know what have been modified by the user (user's profile, for example)
If most things are editable by the user, other criteria decide whether to read from the leader: Keep track of the time last updated. if within some time range, read from leader.
If replicas are distributed across multiple datacenters, Any request that needs to be served by the leader must be routed to the datacenter that contains the leader.

But, what if user have multiple devices, additional issues to consider.

remembering the timestamp becomes more difficult. code running on one device doesn't know updated happened to other device. This metadata will need to be centralized.
Replicas across different datacenters, different devices could be routed to different datacenter.

Monotonic Reads

When reading asynchronous followers a user can see things moving backward in time. This happens when a user make multiple reads from different replicas(say two).

If the first reads returns something just been written, but the second reads happens to read from a datacenter that hasn't been written to, it could be very confusing for the user.

Monotonic reads guarantee this does not happen, it's a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency. It make sure that each user always makes their reads from the same replica. Could be chosen based on a hash of the user ID.

Consistent Prefix reads

consistent prefix reads: if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.

This problem has in partitioned databases. In many distributed dbs, different partitions operate independently. No global ordering of writes. All write to the same partition? Not so efficient

yew11 commented 4 years ago

Multi-Leader Replication

master-master or active/active replication

Have a database with replicas in several different datacenters. One leader per datacenter.

Performance: In a single-leader configuration, every write must find the one leader, may increase latency to writes. In a multi-leader configuration, every write can be processed in the local datacenter and is replicated asynchronously to other datacenters. network delay is hidden from users.
Tolerance of datacenter outages: Single: A leader fails, failover can promote a follower in another datacenter to be leader. Multi: each datacenter can continue operate independently.

Downside for multi-leader: same data may be concurrently modified in two different datacenters, and those weri

yew11 / stuff-i-read