I read, ~~I forget~~ I record

Overload Control for Scaling WeChat Microservices

Year: 2018, referenced count: 49, comes from: Tencent Inc. + Columbia University + National University of Singapore

TLDR

Problem overload control
Goal provide the best user experience with limited resources(economic consideration) handle WeChat's huge scale
How adaptive service degradation(request shredding) based on service priority + user priority queue delay for detecting the overload and piggybacks to upper servers

More Detail

Problem can be caused by load surge or some internal reasons such as network outage, changes of system configuration, software bugs note: for providing high availability, we can't avoid failure but we make our systems can recover fast when failure happens
Goal WeChat's workload is dynamic changed, for such a dynamic change workload, it's uneconomic to over-provision physical machines. And WeChat wants to provide the best user experience. So they drop requests based on business function priority and user priority. Another thing that needs take into consideration is WeChat has a huge cluster(3000+ services), so the overload control has to be adaptive and service agnostic
How Use queue time to detect overload. Assign different priorities to different business functions, such as Login, Payment. Priority slots for users, hash user into the slots, update user's priority period. Piggyback control info to upper service with responses

Highlight

queue time for detecting overload, service agnostic, 3 Overload Scenarios

business function based + user based requests drop

Others

I'm really curious about the relationship between overload control in microservices architecture and congestion control in the network. It's a good choice to use queue time for detecting overload, but how about RTT as TCP congestion control does. From the authors' ordering, we can find this research is led by industry instead of academia.

Maybe Next

TIMELY: RTT-based Congestion Control for the Datacenter
Google Handling Overload
Scaling your API with rate limiters

Scaling Memcache at Facebook

TLRD

Goal

huge throughput
evolving product constantly

Problem/Situation

two orders of magnitude more reads than writes
off-the-shelf
transient stale data is acceptable for user
economic & latency

Idea

scale cache
trade consistency for performance
handle problems at different scale

Detail

Single front-end cluster

Goal
- handle read heavy workload
- handle wide fanout
Architecture one DB installation + memache servers
Problem
- a few memache servers
- stale sets
- thundering herds
- many memache servers
- in-cast congestion
- handling failures

Multiple clusters

Goal handle all-to-all limits of single front-end cluster
Architecture multiple memcache front-end cluster + one DB installation
Problem
- cache invalidation McSqueal + Memcache Routers

Geo distributed

Goal
- fault tolerance
- latency + economic
Architecture multiple geo region DB installation for each region one region owns master DB, others' are replica
Problem
- data consistency
- remote markers

Reading Resources

Author's sharing slides + video [1] => MIT 6.824 Lec 16 note + video + FAQ [2] => paper [3]

Thinking

The real life's problem is interesting because we have to handle it with limited resources.

We need enough skills but have to keep trade-off in mind and keep asking what's our real goals.

Many industry solutions are not ideal such as facebook memcache, Google File System or Google Dapper, but they tackled the problems(or most of problems) and inspired others.

And we can find a general idea is used many times in this paper, the idea is design so that efficiency increases with load rather than decreasing [4].

References

Scaling Memcache at Facebook at NSDI '13 https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala
MIT 6.824 Spring 2022 https://pdos.csail.mit.edu/6.824/schedule.html
Scaling Memcache at Facebook https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf
MIT 6.S081 Receive Livelock lecture note http://nil.csail.mit.edu/6.824/2020/notes/l-memcached.txt

The SB-tree: An Index-Sequential Structure for High-Performance Sequential Access

Why Read This?

For understanding the SLM-tree[1] paper and learn cost analysis.

TLRD

Goal

A B-tree variant for offering high-performance long-range retrieval.

Situation/Problem

Single multi-page disk access is very fast(a factor of then) compared with reading pages from different places.

Solution

For B-tree, it organizes and manages entries(key-value pair) in one-page size node.

For SB-tree, it organizes and manages the nodes in a multi-page block.

The multi-page block is a sequence of pages that can be accessed together.

Cost Analysis

For analysis I/O costs of B-tree, we need to consider the buffer.

The paper splits the tree into 3 parts, the levels above w, the level w, and the levels below w.

The paper supposes all nodes which are above level w are in the buffer. Some of the nodes in level w are in the buffer.

And all nodes below level w are not in the buffer.

The next step is calculating the probability P0 that any given node n0 on level w is not in buffer.

After we searched for a key all nodes along the search path can be cached in the buffer.

It is a set of nodes for one key search.

And we can suppose the buffer can have n such sets(the paper has a much more accurate definition, but it is not easy to understand, so I simplified this).

Then the probability p0 is P0 = (1- 1/Z) ^ n, Z is the nodes count in level W.

One example is given by the paper: buffer size is 1000, leaf nodes and internal nodes are 70% full, and index separators are 8 bytes long

Screen Shot 2022-04-28 at 9 27 58 AM

Screen Shot 2022-04-28 at 9 20 49 AM

Thinking

A storage layer that can do similar things should be interesting, the layer can be used for keeping adjacent nodes in a multi-page disk roughly so the upper layer doesn't need to know how the layer works. Not sure BD works in this way.

References

P. O'Neil "The Log-Structured Merge-Tree (LSM-Tree)" 1966

The Google File System Reading Note

GFS is not a perfect system but it is great as it demonstrated how to make trading off, design choices, and achieve high availability with limited resources.

Highlight

Simplicity is the first design goal.

They focus on only important goals, such as design for huge sequence access but not for random access.

Single master, weak consistency model, co-design client and master.

Huge chunk file size and separate control flow and data flow for removing the single master bottleneck.

References

Sanjay Ghemawat. The Google File System. 2003

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases Reading Note

TL;DR

Goal

Scale database in the cloud environment
High availability

Problem/Situation

Traditional DB sync costs too much in the cloud environment

Reduce sync cost and increase sync performance.

Idea

Cross-layer optimization: DB and storage servers
- decouple storage from compute
- less data
- async
Quorums for smooth handling of slow/failed replicas.

More Detail

AWS EC2 doesn't ideal for databases, is not easy to scale, physical machine dies, disk dies, and can't recover fast.

Instead of using EC2, AWS can provide EBS for database storage, then storage can fail independently.

But traditional database sync will cost too much as those data need to go through the network. And can't provide enough fault-tolerant.

Then AWS designed Aurora: decoupling storage from compute(replica log only, async operations) and quorums for fault-tolerant.

Decoupling storage from compute

The mirrored MySQL model has two problems: it needs to replica too much data(such as data page) and most of them are sequential and synchronous.

For Aurora, it just sends log and metadata to storage, and all remaining things are storage's duty.

And after log has been written to the storage disk(step 1, 2 for the above figure), all remaining steps are asynchronous.

Quorum read/write technique

The quorum read/write technique is used for guarantee to read latest data even if some replicas failed.

For reading be aware of the most recent write, which needs R(read node) + W(write node) > N(total node).

And for writing be aware of the most recent write to avoid conflicting write, need W > N/2.

AWS wants to allow read even if one AZ failed and one additional node failed, which means 3 replicas are not enough.

So they use 6 replicas and each AZ has 2 replicas. Then set R = 3 and W = 4. For such a setup, clients can write even losing an entire AZ(2 nodes) and can read the latest data even losing an AZ and one additional node.

There are two independent failures, AZ failed and one node failed.

AWS thinks at some point, it's hard to reduce the probability of MRRT, so they focus on reducing MTTR.

Google File System(GFS), uses many inexpensive commodity components to build distributed file systems and the component failures are the norm rather than the exception.

Google uses fast recovery strategy to achieve high availability.

In my understanding, we can't guarantee a service never fail because it is infeasible to solve all possible problems.

But we can guarantee fast recovery, as it's a concrete problem and not hard to tackle.

The trick of fast recovery is to covert unknown problems into concrete problems.

References

Alexandre Verbitski. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases, 2017
MIT 6.824: Distributed Systems Spring 2020 LEC 10
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System, 2003

Improving Class Firewall Regression Test Selection by Removing the Class Firewall Reding Note

Agenda

TL;DR

Abstract

Change-based regression technique is a variant of Class firewall technique which found the same defect, selected fewer tests, and required a simpler, less costly analysis.

Idea

Class firewall builds a class relation diagram(CRD) and uses the diagram for selecting tests.

The CRD is for all dependency relationships, but the tests may not use all.

Change-based regression technique builds the dependencies through running all test cases which will select fewer test cases.

When running a test case, the test case will execute some classes, any of such classes may cause the test to get a different result.

More Detail

Class firewall builds class relation diagram(CRD), the diagram is the whole picture for the program.

But for each test case may rely on only part of the dependency info.

Change-based regression technique builds dependency for each test case, which can select fewer test cases.

When executing a test case, the touched class may affect the test case.

Class firewall regression test selection definition:

For a changed class, C, a program, P and a set of test cases for P, Class firewall regression test selection, denoted CFTS(C), selects the subset of the test cases which touch sets, when executed on P, includes a class in the class firewall of C.

Change-based regression test selection technique definition:

For a changed class, C, in a program, P and a set of test cases for P, Changed based regression test selection, denoted CTS(C), selects the subset of the test cases which touch sets, when executed on program P, includes C.

References

M. Skoglund and P. Runeson. Improving class firewall regression test selection by removing the class firewall. 2007

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Year 2012, Ref: 5662, NSDI, UC Berkeley

TL;DR

Problem/Goal handle iterative algorithms efficiency
Idea keeping data in memory instead of disk(compare with MapReduce) data locality for reducing network wast handle abnormal(slow node or failure) efficiency
How generalize the problem into dataflow build a complete view of dataflow immutable data (atomic and multiple versions)

More Detail

The Problems and the Ideas

Spark is designed for large-scale data analytics.

For data analytics we need to consider:

task needs a lot of computation power
single machine is limited that means we have to split a big task into small ones and send those small tasks to different machines
handle slow task/node and task/node failure

Before Spark, MapReduce didn't support iterative algorithms well.

But iterative algorithms are common in data analytics, so Spark wants to solve this problem.

The first idea of Spark is: using memory to store data instead of disk.

MapReduce needs to save data into GFS[2] between each job,

for example, if a task needs to map an array and then do filtering, the map result will be saved in GFS

and filter needs to read for the GFS.

If we save the data in memory and reuse it between jobs that can speed up the tasks a lot.

The next question is how could we present the data.

One way is using Distributed Shared Memory(DSM) and making data mutable.

But mutable data is hard to manage, such as for sharing, passing around and recovery.

Let me see a simple example.

The input is an array of numbers and we want to add 1 to each element,

input: data [data00, data01, data02 ....]
operation: (map (fn [x] (+ x 1))
output: data1 ([data10, data11, data12 ....])

If we do this by updating the data in space(mutable):

i = 0

while i < data.length
  data[i] = data[i] + 1
  i += 1
end

during the execution, the data in memory will be:

[data10, data11, data12, data03, data04, ....data0n] or [data1x, data0x].

It seems fine for normal cases.

But for a large cluster, we need to consider failure: slow job and node crashing[2].

When we find a slow job, we'd like to migrate the job to other nodes, it's hard to migrate data likes [data1x, data0x].

But if the data is immutable, we can do this easily.

As the data0 is in memory and will not be changed, we just need to send it to other nodes.

Immutable can make state changes clear and manageable(Erlang doesn't allow variable reassign, it will help a lot when writing complex currency programs).

Or we can think this in DBMS word, it provides atomic and it has multi versions data.

How it works

Partition

Spark will split jobs and let different nodes handle the job.

Linkage

Spark use linkage to present the whole flow and do operation(optimization and recovery) based on the flow.

Let's see the map example again.

input: data0 [data00, data01, data02 ....]
operation: (map (fn [x] (+ x 1))

If the operation failed and we have data0, we can re-run the code.

Immutable allows us to access the input data0 and we know the operation we can re-run it as we need.

That can help users to do recovery easily(data1 is just equal apply operation1 to data0)

Let's see a more complex example.

input: data0 [data00, data01, data02 ....]
operation0: (map (fn [x] (+ x 1))
operation1: distinct

After we finished operation0, if we know the next operation is distinct, we can order our data and partition it.

Then the distinct just needs to pull its partition.

From the above examples, we can find if we know the whole picture, the picture can help us to do optimization and recovery.

So when Spark loads the job, it will build the picture first and the picture is called linkage.

A similar thing is called lazy execution in functional programming and ActiveRecord[3] is an elegant usage case.

Narrow dependencies and wide dependencies

A parent RDD is used by at most one partition, it is a narrow dependency.

For example, the map

For wide dependencies, multiple child partitions may depend on it, e.g. the distinction operation

Such difference affects how we schedule a job and how we handle recovery.

For narrow dependencies only, we can let

the partition run to the end.

But for wide dependencies, we need to wait for all others to finish.

During recovery, narrow dependencies are easy as they only depend on one partition.

But for wide dependencies, maybe require a complete re-execution.

Summary

Spark is very successful as it identifies and solves a real problem in data analytics.

For me, I'd like to think Spark as a functional programming usage in the distributed system context.

A complete data flow can help with performance optimization and failure recovery

Re-using data in memory can give a lot of performance benefits.

References

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of NSDI, pages 15–28, 2012. https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York, 2003.
ActiveRecord. https://github.com/rails/rails/tree/main/activerecord

COPS Reading Note

Title: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS SOSP 2011, Ref: 801, UC Berkeley, Princeton University, Intel Labs, Carnegie Mellon University.

TL;DR

Problem/Goal Provide available, low latency, partition tolerance, and high scalability for geo-distributed data stores.
Idea Consider performance requirements first then provide the best consistency model. Causal+ consistency
How Async replicas data between data centers

Clients maintain dependencies info and pass to datacenter.

More detail

Causal+ consistency

Causal is used to describe one event may affect another in a distributed system.

In the physical world, we say event a may affect event b if event a happens before event b.

We need to describe this relationship in distributed systems too.

In a distributed system, there are many processes and each process is independent.

In a process, we can know events' order easily, if event a occurs before event b, then event a may affect event b and we say event a causal effects event b.

Between processes, the only way to share information is message passing.

What we can know is if a is the sending of a message by one process and b is the receipt of the same message by another process, we know a happens before b and say event a causal effects event b.

If event a causal effects event b and event b causal effects event c, we know event a causal effects event c, it is transitivity.

That means two processes can work independently if they do not send messages to each other.

For example:

process a:

create_book1
create_book2
get_books_count # return 2

process b:

create_book3
get_books_count # return 1

Each process only sees part of the whole system.

It's helpful, as in each process' perspective, it is reasonable and most of time we do not need global information.

Why causal consistency is useful

Easy to program

Even through eventual consistency can offer good performance, but it's too hard for developers to handle consitency.

Such as for a photo manager:

c1 and c2 are in different data centers:

c1 upload photo the and the reference to the list:
   put(photo)
   put(list)

c2 reads:
   get(list)
   get(photo)

For eventual consistency, all data centers can apply the two changes in any order -- update the list first or create the photo first.

When a user reads the data, he may see the new list has the photo's reference but can't read the new photo.

Causal consistency can avoid such things.

When c2 reads the list, he can read an old list that doesn't have the new photo reference, or he can read the new list and read the new photo.

That because, put(photo) causal effects put(list).

When c2 reads the list by get(list), if the list contains the new photo's reference, c2 must "know" put(list) for keeping causal consistency.

Good performance

For strong consistency, we need to maintain a global order for all events. But for causal consistency, we only maintain partial orders.

That means we can achieve much more currency if we only guarantee causal consistency.

So causal consistency can provide better performance than strong consistency.

Conflict

Let's see an example:

process a:
set(x, 1) # event1
send_msg_to_process_b # event2

process b:
set(x, 2) # event3
receive_msg_from_process_a #event4

If process a doesn't send message to process b all things work smoothly, all events can happen in any order.

But when process b received the message, he needs to determine whether x is equal to 1 or equal to 2.

That means we need some conflict-solving mechanisms.

We need consider process b receives a message from process a and we need consider process a receives a message from process b too.

For avoiding diverging forever, both process a and process b need to get the same result after executing the conflict-solving mechanism.

That is

Convergent conflict handling requires that all conflicting puts be handled in same manner at all replicas, using a handler function h.[1]

And causal consistency + convergent conflict handling is the causal+ consistency[1] in the paper.

Read-Only Transaction

We see causal consistency can avoid some kinds of abnormal situations, but it's not enough.

Let's see the paper's example: a photo list with an ACL (Access Control List).

1 c1:
2 put(ACL)
3 put(picture)
4
5 c2:
6 read(ACL)
7 get(picture)

c1 updates the ACL at line 2 which removes c2 from the access list and creates a sensitive photo into the list.

Then c2 reads the access list but he may read it from a different data center and the ACL hasn't been updated. So he can see the list of photos.

And when he reads the list of photos, the list has been updated so can see the sensitive photo.

Those things can happen because causal consistency can only guarantee orders but not consistent versions.

For avoid such things, we need read-only sort-of transactions.

How it works

Architecture

The Datacenter is GEO-distributed across countries and each Datacenter contains Clients and Data Store Cluster.

The Client is connected to the user and used for managing the client's dependencies and sending request to its local Data Store Cluster.

The client will read and write data to the local Data Store Cluster only and Data Store Cluster will replica data between each cluster asynchronously.

The client is COPS' clients, not browsers or apps, it's backend servers.

How Simple Write and Read Works

Let's see the photo manager example again:

c1 upload photo the and the reference to the list:
   put(photo)
   put(list)

The writing request will go to the client's local cluster and the propagate asynchronously to remove clusters.

Write to the local cluster put(photo) will send request(put_after(key=photo_key, val='some-value', nearest=nil, vers=0)) to local store node, after the node committed the data, the node will send a reply to the client.

After client received the reply, then client can update the list by put_after(key='list_key', val=with_the_photo_ref, nearest=[{key=photo_key, vers=0}]).

Wrtie replication between clusters

After the Photo Node committed the data, it will send requests to remote clusters. Photo List Node will do the same thing.

The remote clusters can receive those two requests in any order.

When the remote Photo List Node received the request, it will send check_deps request to its local Photo Node first for checking dependencies,

if the Photo Node replies OK, then the Photo List Node will commit the data.

Screen Shot 2022-08-11 at 09 57 54

Solving Conflict

Causal+ consistency only captures part of the order, so we need mechanisms to solve conflicts.

COPS uses Lamport clock solving conflicts by default and can allow application config the conflict handler.

Lamport clock is a "logical clock", some simple implementation can be, for each process, we can maintain a counter, when one event happens, we can increase one.

clock += 1

When sending a message to another process, it needs to pass the counter or the clock to the receiver process.

And when a process received the message, it needs to advance its clock.

clock = max(message.clock + 1, current_clock).

For more detail, you can check Erlang Mnesia's implementation[2] or Lamport's paper[3].

Read-only transactions

The read-Only Transaction section has explained why we need read-only transactions, now let's see how it is implemented.

The get_trans is used for getting multiple keys in a transaction. And the client library will do this in 2 rounds.

In the first round, the client library will request the n keys concurrency.

The keys may belong to different shards, so we can't get them in one request.

And as they are sending currently, the results may belong to different "snapshots".

For example {ACL0, version=10} <- [], {Photo1, version=0} <- [{ACL1, version=11}].

For fixing that, we need to calculate all the newest versions and then get the newest versions.

Notice, that dependency is a kind of order, if a depends on b, then b can't depend on a, COPS has solved the conflict.

So in the second round, after we get the newest version, we can get the "snapshot".

Summary

I'm curious what effort we need to make when we use causal+ consistency in a real application and how chould we solv the conflict.

This consistency model is helpful and inspiring but not easy to understand, in industry, people may like strick forward solution.

References

Wyatt Lloyd. Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS, SOSP 2011
Erlang Mnesia Implementation https://github.com/erlang/otp/blob/40922798411c2d23ee8a99456f96d6637c62b762/lib/mnesia/doc/misc/implementation.txt
L. Lamport. Time, clocks, and the ordering of events in a distributed system. 1978.

Google Spanner Reading Note

Spanner: Google’s Globally-Distributed Database

OSDI 12, Ref: 801, Google, Inc.

TL;DR

Problem/Goal strong consistency, geographically distributed, fast read
Idea External Consistency by TrueTime 2PC + Paxos Snapshot Isolation

More Detail

External Consistency + TrueTime

We need to choose a strong consistency model which works for geographically distributed databases.

That allows users to write data in different data centers and the data centers can be in different regions or even countries.

The external consistency guarantees: if T1 completes before T2 starts, T2 must see T1's writes.

The "before" refers to real or physical time.

In a distributed system we can't get the real-time so Google uses TrueTime and Start Rule + Commit Wait Rule to determine such order.

The TrueTime server will return an interval and the server can guarantee the interval contains the involved real-time.

For example, a client at real-time t0 calls TT.now(), the time server will return[earliest, latest]which can guaranteeearliest < t0 < latest`.

Now let's see how to use the TrueTime to guarantee External Consistency.

External Consistency

There are two kinds of relationships between events, concurrency, and none-concurrency.

For concurrency events, we can assign an order to the events.

For non-concurrency, it may need to keep the order for keeping correctness[2], the start and commit wait rules are used for this purpose.

Start rule needs us to use the latest of TrueTime to assign the transaction ts, ts = TT.now().latest.

Commit wait rule is, that when we want to commit a transaction, we need to wait time past the ts or TT.now().latest > ts.

Suppose we have two transactions, tw and tr, tw is a write transaction, tr is a read transaction.

And tw completes before tr starts(External Consistency's assumption) means abs(tw) < abs(tr).

For the write transaction tw, the ts is assigned by the Start Rule, tw.ts = ts0.

To obey the Commit Wait Rule, before we commit, we need to wait and make sure the ts0 has passed, which means tw.ts = ts0 < abs(tw).

For the read transaction tr, we need to obey the Start Rule, which means tr.ts = TT.now().latest and abs(tr) < tr.ts.

And we suppose abs(tw) < abs(tr), then we can get tw.ts < tr.ts, this is external consistency's requirement.

2PC + Paxos -- How Read-Write Transactions Works

In a data center, data will be split into different shards, the shards which have the same data will form a Paxos group and each Paxos group has a leader.

Before starting 2PC, the client will send read requests to shards' leaders and buffer write in local.

Each written shard leader will acquire lock and replica lock, the 2PC message, and data to its replica.

The coordinator will follow the Start Rule and Commit Rule.

Snapshot Isolation

For geographically distributed databases, reading from local can reduce latency and increase throughput.

Spanner chooses Snapshot Isolation for achieving those goals.

Snapshot Isolation allows reading at the "time" and will not block following transactions.

Referenece

James C. Corbett. Spanner: Google’s Globally-Distributed Database, OSDI 2012
MP Herlihy. Linearizability: A correctness condition for concurrent objects, ACM Transactions on Programming Languages and Systems, 1990

yfractal / blog

Reading Notes #8

Overload Control for Scaling WeChat Microservices

TLDR

More Detail

Highlight

Others

Maybe Next

Scaling Memcache at Facebook

TLRD

Goal

Problem/Situation

Idea

Detail

Single front-end cluster

Multiple clusters

Geo distributed

Reading Resources

Thinking

References

The SB-tree: An Index-Sequential Structure for High-Performance Sequential Access

Why Read This?

TLRD

Goal

Situation/Problem

Solution

Cost Analysis

Thinking

References

Slack’s Incident on 2-22-22 Reading Note

The Google File System Reading Note

Highlight

References

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases Reading Note

TL;DR

Goal

Problem/Situation

Idea

More Detail

Decoupling storage from compute

Quorum read/write technique

References

Improving Class Firewall Regression Test Selection by Removing the Class Firewall Reding Note

Agenda

TL;DR

Abstract

Idea

More Detail

References

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

TL;DR

More Detail

The Problems and the Ideas

How it works

Partition

Linkage

Narrow dependencies and wide dependencies

Summary

References

COPS Reading Note

TL;DR

More detail

Causal+ consistency

Why causal consistency is useful

Easy to program

Good performance

Conflict

Read-Only Transaction

How it works

Architecture

How Simple Write and Read Works

Solving Conflict

Read-only transactions

Summary

References

Google Spanner Reading Note

TL;DR

More Detail

External Consistency + TrueTime

External Consistency