I use the NewAPIHadoopRDD method in the spark context for the ElasticSearch-Hadoop connector, and am working on how I can mock that out here. My thinking is that there are 3 ways to do it:
Actually implement a java caller in py4j or something similar
Hack in support for the ES-Hadoop connector functionality in pure python to serve just my needs
Add in a way to pass a custom implementation for NewAPIHadoopRDD as a patch, so users can set that functionality at runtime in their applications.
1 is definitely best, but most difficult, and 2 is definitely a bit too specific to just what I'm using this for, 3 is really flexible, but that does break the drop-in compatibility with pyspark we've maintained thus-far.
I'm thinking of basically just doing a string comparison on the input classes and if elasticsearch is in them, I'll implement 2, otherwise raise a not implemented error, and then have a method for setting a patched functionality as a fallback. If there are any other classes/drivers people are using we can try to implement them similarly.
In think number 2 is the best at this moment. That will be similar to the hack you have already implemented to support S3 files on the textFile method.
cc: @htssouza
I use the NewAPIHadoopRDD method in the spark context for the ElasticSearch-Hadoop connector, and am working on how I can mock that out here. My thinking is that there are 3 ways to do it:
1 is definitely best, but most difficult, and 2 is definitely a bit too specific to just what I'm using this for, 3 is really flexible, but that does break the drop-in compatibility with pyspark we've maintained thus-far.
I'm thinking of basically just doing a string comparison on the input classes and if elasticsearch is in them, I'll implement 2, otherwise raise a not implemented error, and then have a method for setting a patched functionality as a fallback. If there are any other classes/drivers people are using we can try to implement them similarly.
Thoughts?