zxs / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator
0 stars 0 forks source link

Additional columns/fields for Hadoop loading #864

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
1. To which tool/application/daemon will this feature apply?

CSV Batch loader

2. Describe the feature in general

Two new columns added to the CSV content so that it can be processed:

+ Schema (useful for cross-schema loading and merging)
+ Sourcename or hostname

In addition, the CSV filenames should contain the source hostname to prevent 
collisions when writing from multiple hosts into the same Schema directory 
within HDFS

3. Describe the feature interface

This should probably be standard columns, as the info is lightweight.

4. Give an idea (if applicable) of a possible implementation

Just the addition of two columns to the main CSV applier
Expose and/or add the source name into the JS script to build the filenames 
properly
The main GitHub tools will need to be updated to support additional columns for 
processing

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

Will enable parallel loading from multiple hosts into one Hadoop instance 
without collisions, and allow for selective querying/merging during 
materialisation.

5b. What hardship will the human race have to endure if this feature is
implemented.

Slightly larger file sizes .

6. Notes

Original issue reported on code.google.com by mc.br...@continuent.com on 9 Apr 2014 at 4:40

GoogleCodeExporter commented 9 years ago
Column addition could perhaps be addressed using a filter that adds columns to 
specific locations in the table.  It is also possible to extend the batch 
loader but I believe this type of information is valuable any time users are 
aggregating data, and filters will work anywhere. 

Adding source names to files could be addressed using the existing partition by 
support (see Issue 840) and partitioning by source.  In this case the names 
would actually be directories.  

One final consideration--if we add this information dynamically it should 
somehow flow back to schema generation.  There is an argument for putting 
schema information into the replication flow and letting filters operate not 
only on row changes but metadata as they flow by.  The appliers can be extended 
to generate Hive schema (for example) at apply time or somewhere along the way. 

Original comment by robert.h...@continuent.com on 9 Apr 2014 at 9:11