zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
281 stars 28 forks source link

extend geni cli to "transform" data to arrow ? #286

Open behrica opened 3 years ago

behrica commented 3 years ago

One of the use case I have in mind for "geni" and why I developed as well #284 , was to use geni/spark as a first step to transform "arbitrary data" into arrow files (for using them in TMD mainly)

Ideally I would have a cli tool for this, which does the following operation:

(->
 (g/read-xxx!        ; xxx-> "parquet" or "csv" or ....
 (g/repartition n)
 (g/collect-as-arrow m dir)

Maybe "geni" cli could become this tool.

So it gets run as "geni repl" -> as now

or alternatively like this:

"geni to-arrow  xxxx.csv   10 50000 /tmp  

I would hope that this "simple" case is enough for most cases. Eventually the "transform" need to be extended to allow 2 more things:

The first would require to extend #284 to allow to write several arrow files which are partitioned by the groups. I am not sure, if this is even possible to do, while assuming big data and therefore "limited heap space".

And to have it very useful, TDM need to have "multi-file dataset support" for arrow files in some form: https://github.com/techascent/tech.ml.dataset/issues/145

behrica commented 3 years ago

Maybe an easier pathway to the above is:

Maybe in this case #284 is not needed at all.

anthony-khong commented 3 years ago

I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!

As for your CLI tool for Arrow conversion, I think it'd be straightforward to bake it into the current Geni CLI, so that geni gives you the REPL and geni :to-arrow $SOURCE_PATH $DESTINATION_PATH does the Arrow conversion. And we could develop a number of built-in, frequently used mini apps like that to the CLI.

behrica commented 3 years ago

Maybe it helps to think about Geni + TDM in three scenarios:

1) I want to use TMD interactively but I have initial big data:

2) I have big data, but I don't know how to filter it yet. Exploration of big data + interactive work is needed

3) I write a complex ETL job, starting from big data but then I want to continue in TDM or other clojure "in memoy" libraries

behrica commented 3 years ago

A "geni repl" cli tool, would only support 1)

2) and 3) require other forms of integration of Geni and TDM:

behrica commented 3 years ago

I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!

Bindings ? Or making Tabecloth API working with Geni ?