extend geni cli to "transform" data to arrow ?

behrica commented 3 years ago

One of the use case I have in mind for "geni" and why I developed as well #284 , was to use geni/spark as a first step to transform "arbitrary data" into arrow files (for using them in TMD mainly)

Ideally I would have a cli tool for this, which does the following operation:

(->
 (g/read-xxx!        ; xxx-> "parquet" or "csv" or ....
 (g/repartition n)
 (g/collect-as-arrow m dir)

Maybe "geni" cli could become this tool.

So it gets run as "geni repl" -> as now

or alternatively like this:

"geni to-arrow  xxxx.csv   10 50000 /tmp

I would hope that this "simple" case is enough for most cases. Eventually the "transform" need to be extended to allow 2 more things:

specify group-by columns and write arrow files partitioned
specify arbitrary "filter" criteria to shrink the data

The first would require to extend #284 to allow to write several arrow files which are partitioned by the groups. I am not sure, if this is even possible to do, while assuming big data and therefore "limited heap space".

And to have it very useful, TDM need to have "multi-file dataset support" for arrow files in some form: https://github.com/techascent/tech.ml.dataset/issues/145

behrica commented 3 years ago

Maybe an easier pathway to the above is:

let geni/spark do everything and let it write parquet files to disk
write a cli tool which can convert a "directory of parquet" files into a "directory of arrow files"

Maybe in this case #284 is not needed at all.

anthony-khong commented 3 years ago

I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!

As for your CLI tool for Arrow conversion, I think it'd be straightforward to bake it into the current Geni CLI, so that geni gives you the REPL and geni :to-arrow $SOURCE_PATH $DESTINATION_PATH does the Arrow conversion. And we could develop a number of built-in, frequently used mini apps like that to the CLI.

behrica commented 3 years ago

Maybe it helps to think about Geni + TDM in three scenarios:

1) I want to use TMD interactively but I have initial big data:

I know perfectly how to filter my big dataset (no exploration of big data needed)

2) I have big data, but I don't know how to filter it yet. Exploration of big data + interactive work is needed

3) I write a complex ETL job, starting from big data but then I want to continue in TDM or other clojure "in memoy" libraries

behrica commented 3 years ago

A "geni repl" cli tool, would only support 1)

2) and 3) require other forms of integration of Geni and TDM:

exchange of parquet files on disk
collect-to-arrow
collect-to-TMD (as recently implemented by Chris in TTMD)

behrica commented 3 years ago

I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!

Bindings ? Or making Tabecloth API working with Geni ?

zero-one-group / geni

extend geni cli to "transform" data to arrow ? #286