Closed aaelony-catasys closed 4 years ago
Actually, perhaps
(def df (read-csv! "hdfs://some/path/on/hdfs/to/a/subdir/one-of-the-files.csv"))
is what I need, if I can resolve an error message of:
Execution error (UnknownHostException) at org.apache.hadoop.security.SecurityUtil/buildTokenService (SecurityUtil.java:378).
NIce project!
I got it working for "wasb://" URLS to read from Azure Blob Storage.
This required quite some digging, on which jars to add and which configuration options to pass. I suppose that working with HDFS might be similar.
Hi @aaelony-catasys, thank you for raising the issue! I believe this should be possible. Geni is just handling Spark objects and calling Spark methods. If you see this article, it seems doable. Not sure if you are running to this issue though.
@behrica, that sounds awesome! Would you mind sharing what config options to pass, and we can add it to the docs or the README? 😄
Actually, I think the issue is kerebos related. Trying to identify how to properly kinit
from within geni
.
Hi @aaelony-catasys, are you still having issues with reading from an HDFS path?
Hi @anthony-khong, it will take me some time to get up to speed on kerberos, spark, docker ports, etc and how they interrelate. Unfortunately, I don't have spare cycles to devote time to this in the near term. You might wish to close this ticket for the time-being.
Best regards
Just making a note that I got this working today in the repl against a kerberized HDFS so definitely possible. I’ll compile some notes after I make sure everything works as a deployed job too.
Does geni support reading directly from an HDFS path?
Is there something akin to the following?
... where
/some/path/on/hdfs/to/a/subdir/
is a path on hdfs that contains many files?thanks in advance.