zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
285 stars 28 forks source link

Does geni support reading directly from an HDFS path? #228

Closed aaelony-catasys closed 4 years ago

aaelony-catasys commented 4 years ago

Does geni support reading directly from an HDFS path?

Is there something akin to the following?

(def df (read-from-hdfs "/some/path/on/hdfs/to/a/subdir/"))

... where /some/path/on/hdfs/to/a/subdir/ is a path on hdfs that contains many files?

thanks in advance.

aaelony-catasys commented 4 years ago

Actually, perhaps

(def df (read-csv! "hdfs://some/path/on/hdfs/to/a/subdir/one-of-the-files.csv"))

is what I need, if I can resolve an error message of:

Execution error (UnknownHostException) at org.apache.hadoop.security.SecurityUtil/buildTokenService (SecurityUtil.java:378).

NIce project!

behrica commented 4 years ago

I got it working for "wasb://" URLS to read from Azure Blob Storage.

This required quite some digging, on which jars to add and which configuration options to pass. I suppose that working with HDFS might be similar.

anthony-khong commented 4 years ago

Hi @aaelony-catasys, thank you for raising the issue! I believe this should be possible. Geni is just handling Spark objects and calling Spark methods. If you see this article, it seems doable. Not sure if you are running to this issue though.

@behrica, that sounds awesome! Would you mind sharing what config options to pass, and we can add it to the docs or the README? 😄

aaelony-catasys commented 4 years ago

Actually, I think the issue is kerebos related. Trying to identify how to properly kinit from within geni.

anthony-khong commented 4 years ago

Hi @aaelony-catasys, are you still having issues with reading from an HDFS path?

aaelony-catasys commented 4 years ago

Hi @anthony-khong, it is a kerberos issue that I haven't had the chance to look into in depth. I did find a few urls to research here and here but I don't know anything about kerberos so it might be a while before I can go the route of geni until I can get this resolved.

aaelony-catasys commented 4 years ago

Hi @anthony-khong, it will take me some time to get up to speed on kerberos, spark, docker ports, etc and how they interrelate. Unfortunately, I don't have spare cycles to devote time to this in the near term. You might wish to close this ticket for the time-being.

Best regards

gnarroway commented 3 years ago

Just making a note that I got this working today in the repl against a kerberized HDFS so definitely possible. I’ll compile some notes after I make sure everything works as a deployed job too.