Enable slave replicator to begin replication automatically at current position of master

GoogleCodeExporter commented 9 years ago

1. To which tool/application/daemon will this feature apply?

Tungsten Replicator

2. Describe the feature in general

When replicating to Hadoop or any other type of store that treats updates as a 
feed, it is very useful to allow the slave service to come up and start 
replication from the current master position, regardless of what that happens 
to be.  Tungsten Replicator needs an option like the following to start 
replication from the "current" position, whatever that happens to be: 

trepctl online -from-current

Here is an example how this would be used to provision a set of tables into 
hadoop and also catch all updates thereafter. 

a.) Set up replication on the master. 
b.) Configure the slave service and start up with the -from-current option.  
The slave begins applying all changes into HDFS. 
c.) Run sqoop to fetch data from those same tables and provision to base tables 
in HDFS. 
d.) Run map/reduce on Hadoop to merge incoming updates with the base table data 
from sqoop to create a point-in-time materialized view.  

3. Describe the feature interface

The basic feature interface is to add a "from-current" option when going 
online.  This will be interpreted as follows: 

On a Tungsten master - start replication from the current point in the DBMS 
log.  This is already default behavior for Oracle and MySQL extraction. 

On a Tungsten slave using RemoteTHLExtractor to receive updates - Start 
replication at the latest position in the master THL

4. Give an idea (if applicable) of a possible implementation

The from-current behavior could be a pipeline property so that if we have a 
proper restart position we obey that (of course) but otherwise start at the 
current position of whatever the pipeline is extracting from.  

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

It makes setting up any service that treats replication updates as a feed very 
easy.  This is the case for Hadoop as illustrated above but applies to many 
other types of replication, for example if a replicator is use to feed 
notifications of changes from a DBMS into another system like SalesForce. 

5b. What hardship will the human race have to endure if this feature is
implemented.

We need to ensure it does not somehow get triggered by accident, which would 
cause major problems.  One possible protection is to install a lock on the THL 
that rejects event threads that "skip" seqno values without an intervening 
filtered event.  To get replication to work in this case you would need to 
clear the slave log and its restart point. 

6. Notes

Original issue reported on code.google.com by robert.h...@continuent.com on 5 Mar 2014 at 6:10

Blocking: #356

GoogleCodeExporter commented 9 years ago

Based on team review, we have decided to include the following option 
improvements: 

* Rename `online -base-seqno X` to `online -master-start-seqno X` -- It's 
currently very confusing to users. 

* Add `online -from-seqno X` to have slave fetch a particular sequence number.  
It will probably require a 'force' option to suppress the epoch number check.  
Otherwise there's a risk you can corrupt slaves. 

* Add `online -from-current` to have the slavestart from the latest seqno in 
the master.

Original comment by robert.h...@continuent.com on 6 Mar 2014 at 5:45

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 2 Jun 2014 at 5:53

Now blocking: #356

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Dec 2014 at 7:03

Added labels: FixedIn-3.1.0
Removed labels: FixedIn-3.0.0

GoogleCodeExporter commented 9 years ago

Original comment by linas.vi...@continuent.com on 19 Jan 2015 at 2:18

Added labels: FixedIn-4.0.1
Removed labels: FixedIn-3.1.0

zxs / tungsten-replicator

Enable slave replicator to begin replication automatically at current position of master #846