wix-incubator / kafka-connect-s3

A Kafka-Connect Sink for S3 with no Hadoop dependencies.
Other
56 stars 45 forks source link

Add bzip2 support #21

Open robvadai opened 7 years ago

robvadai commented 7 years ago

This is because Hadoop/Spark systems can not distribute a job when data is loaded from GZip files. GZip is not a 'splittable' format. So for example in Spark, after loading a GZip file one has to repartition the RDD to split it line-by-line. This is done automatically using the bzip2 format.

S3 is a common data source for Hadoop/Spark jobs (straightforward use case with AWS EMR) so having bzip2 support would be essential. Other data ingestion tools like Apache Flume supports bzip2 compression.