Additional columns/fields for Hadoop loading

zxs / tungsten-replicator

Automatically exported from code.google.com/p/tungsten-replicator

0 stars 0 forks source link

1. To which tool/application/daemon will this feature apply?

CSV Batch loader

2. Describe the feature in general

Two new columns added to the CSV content so that it can be processed:

+ Schema (useful for cross-schema loading and merging)
+ Sourcename or hostname

In addition, the CSV filenames should contain the source hostname to prevent 
collisions when writing from multiple hosts into the same Schema directory 
within HDFS

3. Describe the feature interface

This should probably be standard columns, as the info is lightweight.

4. Give an idea (if applicable) of a possible implementation

Just the addition of two columns to the main CSV applier
Expose and/or add the source name into the JS script to build the filenames 
properly
The main GitHub tools will need to be updated to support additional columns for 
processing

5. Describe pros and cons of this feature.

5a. Why the world will be a better place with this feature.

Will enable parallel loading from multiple hosts into one Hadoop instance 
without collisions, and allow for selective querying/merging during 
materialisation.

5b. What hardship will the human race have to endure if this feature is
implemented.

Slightly larger file sizes .

6. Notes

Original issue reported on code.google.com by mc.br...@continuent.com on 9 Apr 2014 at 4:40

Column addition could perhaps be addressed using a filter that adds columns to specific locations in the table. It is also possible to extend the batch loader but I believe this type of information is valuable any time users are aggregating data, and filters will work anywhere. Adding source names to files could be addressed using the existing partition by support (see Issue 840) and partitioning by source. In this case the names would actually be directories. One final consideration--if we add this information dynamically it should somehow flow back to schema generation. There is an argument for putting schema information into the replication flow and letting filters operate not only on row changes but metadata as they flow by. The appliers can be extended to generate Hive schema (for example) at apply time or somewhere along the way.

zxs / tungsten-replicator

Additional columns/fields for Hadoop loading #864