Open behrica opened 3 years ago
Maybe an easier pathway to the above is:
Maybe in this case #284 is not needed at all.
I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!
As for your CLI tool for Arrow conversion, I think it'd be straightforward to bake it into the current Geni CLI, so that geni
gives you the REPL and geni :to-arrow $SOURCE_PATH $DESTINATION_PATH
does the Arrow conversion. And we could develop a number of built-in, frequently used mini apps like that to the CLI.
Maybe it helps to think about Geni + TDM in three scenarios:
1) I want to use TMD interactively but I have initial big data:
2) I have big data, but I don't know how to filter it yet. Exploration of big data + interactive work is needed
3) I write a complex ETL job, starting from big data but then I want to continue in TDM or other clojure "in memoy" libraries
A "geni repl" cli tool, would only support 1)
2) and 3) require other forms of integration of Geni and TDM:
I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!
Bindings ? Or making Tabecloth API working with Geni ?
One of the use case I have in mind for "geni" and why I developed as well #284 , was to use geni/spark as a first step to transform "arbitrary data" into arrow files (for using them in TMD mainly)
Ideally I would have a cli tool for this, which does the following operation:
Maybe "geni" cli could become this tool.
So it gets run as "geni repl" -> as now
or alternatively like this:
I would hope that this "simple" case is enough for most cases. Eventually the "transform" need to be extended to allow 2 more things:
The first would require to extend #284 to allow to write several arrow files which are partitioned by the groups. I am not sure, if this is even possible to do, while assuming big data and therefore "limited heap space".
And to have it very useful, TDM need to have "multi-file dataset support" for arrow files in some form: https://github.com/techascent/tech.ml.dataset/issues/145