zxs / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
0 stars 0 forks source link

Support custom CSV string output formats #844

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
1. To which tool/application/daemon will this feature apply?

Tungsten Replicator

2. Describe the feature in general

CSV string formatting is important when loading to Hadoop because tools like 
Hive use the string representation directly.  There are two main places where 
this causes a problem: 

a.) String formats in CSV result in values that are hard to compare with 
originating values on MySQL and Oracle
b.) Other tools that load data to Hadoop such as sqoop may use different 
formatting, resulting in inconsistent values. 

The replicator needs to add support for pluggable string formatting when 
writing to CSV. 

3. Describe the feature interface

Data sources will have a property to support a pluggable string formatter for 
writing to CSV.  This will be a property of the data source as opposed to 
classes like SimpleBatchApplier that may use the formatter. 

Users can set the property directly.  Otherwise it will default to a reasonable 
value for the data source.  For Hadoop this should be to use formats that are 
consistent with sqoop. 

4. Give an idea (if applicable) of a possible implementation

String formatting classes will implement a general interface called 
CsvStringFormatter.  There will be multiple implementations.  

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

This feature will enable smoother integration with other Hadoop loading tools 
as well as better comparison against source systems.  Also it will make CSV 
formatting more flexible so that we can add new data source types more easily. 

5b. What hardship will the human race have to endure if this feature is
implemented.

A little effort to implement and test the feature. 

6. Notes

Original issue reported on code.google.com by robert.h...@continuent.com on 4 Mar 2014 at 4:29

GoogleCodeExporter commented 9 years ago
This feature has been implemented as follows.  

1. Data sources now contain the name for a "formatter" class, which is 
responsible for converting Java objects to string representation suitable for 
CSV.  The property setting is shown below: 

# CSV data formatter.  This is the class responsible for translating
# from Java objects to CSV strings.  The data format can vary independently
# from the CSV type based where data are extracted from or the types of 
# tools that will process data. 
replicator.datasource.applier.csvFormatter=com.continuent.tungsten.replicator.cs
v.DefaultCsvDataFormat

2. The current implementation provides data formatting to make comparison 
simple with MySQL sources.  Users can provide their own formatters by 
implementing the following interface: 

  com.continuent.tungsten.replicator.csv.CsvDataFormat

The implementation in 
com.continuent.tungsten.replicator.csv.DefaultCsvDataFormat provides a good 
example of how to create your own.  

3. Once further implementations are available, users can include them the 
configuration using tpm as shown in the following update example: 

   tpm update myservice --property=replicator.datasource.applier.csvFormatter=com.continuent.tungsten.replicator.csv.DefaultCsvDataFormat

4. Internally all replicator CSV classes are in a single package named 
com.continuent.tungsten.replicator.csv with a corresponding unit test.  Any 
further additions to CSV processing belong in this package.  

Original comment by robert.h...@continuent.com on 8 Mar 2014 at 3:24

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 8 Mar 2014 at 3:38

GoogleCodeExporter commented 9 years ago

Original comment by robert.h...@continuent.com on 8 Mar 2014 at 3:51

GoogleCodeExporter commented 9 years ago
This feature was implemented successfully. 
Every aspect of the CSV is configurable either by changing the datasource 
values or from the command line.
In fact, this flexibility was exploited by tungsten-sandbox to deploy the 
'fileapplier' topology with custom CSV.

Original comment by g.maxia on 18 Sep 2014 at 3:36

GoogleCodeExporter commented 9 years ago
There are already a few tests for this feature in the regression suite.

Original comment by g.maxia on 18 Sep 2014 at 3:37

GoogleCodeExporter commented 9 years ago
A note has been added into the 3.0 release notes

Original comment by mc.br...@continuent.com on 13 Oct 2014 at 9:19