zxs / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
0 stars 0 forks source link

Minimize data loss window when master replicators fail #715

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
1. To which tool/application/daemon will this feature apply?

Tungsten Replicator

2. Describe the feature in general

Tungsten replicator operates asynchronously, which means that we are prone to 
lose data when a master fails before it replicates to a slave.  On MySQL (and 
other DBMS servers) there are two principle cases: 

a.) A transaction was committed to the DBMS but Tungsten did not extract it to 
the THL yet. 
b.) A transaction was extracted into the THL but not transferred to the slave. 

In the case where a master node fails fully, these cases are impossible to 
mitigate without introducing synchronous replication, hence forcing apps to 
take a latency hit.  Instead, we would like to eliminate or at least minimize 
data loss in any case where the replicator is still alive and able to read as 
well as distributed THL records to clients. 

Specifically we would like to certify the following replicator behavior:  

a.) If the replicator has access to server logs in any form, it should extract 
all events from these logs. 
b.) Similarly, the replicator should continue to serve up THL records to 
clients. 
c.) There should be a mechanism for management clients to get the last 
available transaction from the master so that they can wait for it to appear on 
slaves. 

3. Describe the feature interface

Replicator pipelines will be extended to support continued extraction and 
serving up of THL records after a DBMS failure.  

We should also consider a extension to the 'flush()' JMX call to return the 
last available transaction on the master.  This will allow admin clients to 
fetch out the last seqno and wait for it to appear on slaves, after which they 
can complete failover. 

4. Give an idea (if applicable) of a possible implementation

This requires a proper design, which will be added later.  There are numerous 
race conditions here that require careful consideration.  

The design must also include a very strong set of tests, including a unit test, 
to ensure that we winkle out timing problems at the lowest possible level. 

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

Data loss is an unfortunate side effect of failures on busy systems.  
Eliminating the obvious cases will eliminate a number of administrative 
procedures and simplify recovery from failures.  It will also avoid 
inconvenience to users. 

5b. What hardship will the human race have to endure if this feature is
implemented.

The implementation must be designed carefully.  There is a risk of deadlocks in 
this type of problem that will occur as side effects of checks on progress. 

6. Notes

Original issue reported on code.google.com by robert.h...@continuent.com on 24 Sep 2013 at 10:56

GoogleCodeExporter commented 9 years ago
This issue supersedes Issue 502, which is a duplicate. 

Original comment by robert.h...@continuent.com on 30 Sep 2013 at 12:29

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 30 Sep 2013 at 12:30

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 23 Dec 2013 at 9:51

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 5 May 2014 at 11:27

GoogleCodeExporter commented 9 years ago
Will not use third version digit for normal releases anymore. It will only be 
increment for maintenance ones.

Original comment by linas.vi...@continuent.com on 26 May 2014 at 5:01

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Jan 2015 at 2:18