wdm0006 / DummyRDD

A pure python mock of pyspark's RDD
http://wdm0006.github.io/DummyRDD/
BSD 3-Clause "New" or "Revised" License
27 stars 13 forks source link

NewAPIHadoopRDD Implimentation #24

Closed wdm0006 closed 8 years ago

wdm0006 commented 8 years ago

cc: @htssouza

I use the NewAPIHadoopRDD method in the spark context for the ElasticSearch-Hadoop connector, and am working on how I can mock that out here. My thinking is that there are 3 ways to do it:

  1. Actually implement a java caller in py4j or something similar
  2. Hack in support for the ES-Hadoop connector functionality in pure python to serve just my needs
  3. Add in a way to pass a custom implementation for NewAPIHadoopRDD as a patch, so users can set that functionality at runtime in their applications.

1 is definitely best, but most difficult, and 2 is definitely a bit too specific to just what I'm using this for, 3 is really flexible, but that does break the drop-in compatibility with pyspark we've maintained thus-far.

I'm thinking of basically just doing a string comparison on the input classes and if elasticsearch is in them, I'll implement 2, otherwise raise a not implemented error, and then have a method for setting a patched functionality as a fallback. If there are any other classes/drivers people are using we can try to implement them similarly.

Thoughts?

htssouza commented 8 years ago

In think number 2 is the best at this moment. That will be similar to the hack you have already implemented to support S3 files on the textFile method.