zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
281 stars 28 forks source link

Should Geni have support for Delta Lake or should that be a separate library? #314

Open erp12 opened 3 years ago

erp12 commented 3 years ago

Delta Lake brings a lot of crucial features into the Spark ecosystem. Some of the highlights include:

  1. ACID transactions.
  2. Time travel between data versions.
  3. Safe in-place updates, inserts, and upserts.
  4. Storage optimization.

In some toy projects, I have begun using Delta with Geni using Scala interop. I haven't worked out all the issues yet, but I think I am getting close to something that could be useful to others.

It would be great to get people's thoughts on the best way to manage Delta support. My specific question is: Would it be a good idea to add Delta Lake as an optional dep (like we do with xgboost) and implement the Delta Lake API directly in Geni?

If yes, I can move my code into a PR. If no, I can create a geni compatible Clojure lib specifically for Delta lake.

Some things to consider:

Personally, I think option 1 is best (integrate into Geni) but I don't want to create more work for other people if I am one of the only people who would benefit. :)

anthony-khong commented 3 years ago

Hi @erp12, thank you for the issue! Sorry for the delayed reply, I only just realised that you wrote this issue. It must have fallen between the cracks!

I'm not super familiar with Delta Lake, but it honestly sounds like a really good idea and slots it nicely with the general Spark API, so it makes sense to support it as an optional dependency!

As for with-dynamic-import, on suggestion would be to develop it without making it optional, then at the last minute, just add the with-dynamic-import. But you still run into the same problem when maintaining it.

I also wonder if it's possible to add an optional parent namespace, and have the dynamic import in another namespace..? 🤔 For instance, instead of having the zero-one.geni.ml.xgb namespace define the functions, we have zero-one.geni.optional.xgb do the regular import and functions, then have zero-one.geni.ml.xgb require zero-one.geni.optional.xgb wrapped in dynamic import. Not sure if this is even possible, because importing is quite funky.

erp12 commented 3 years ago

I have finished a branch that provides support of Delta. Unfortunately, Delta 0.8 is incompatible with Spark 3.1.x which prevents me from testing some of the core functionality without downgrading geni to spark 3.0.2.

I'm not sure what the best way forward is. We could revisit this after the next Delta release, or we could disable the incompatible functionality and leave a todo for re-enabling in the future.

anthony-khong commented 3 years ago

Hmm, honestly I'm not so sure myself if Geni should stay toe-to-toe with the latest version of Spark. I wonder if we should keep tabs on the latest, most compatible version of Spark with the ecosystem. @behrica brought this up as well previously with potential incompatibility with Spark NLP before they supported Spark 3.

So, I'm not against downgrading to Spark 3.0.2 if that means unlocking awesome other features from the ecosystem. I'd be very interested in what you and other Geni users think about this!

dakra commented 3 years ago

Very recently I started looking at switching our parquet lake to delta, so I appreciate this change very much.

Maybe it makes sense to have the Spark version the highest possible that works with all of the features that Geni provides (I guess mainly Spark NLP and Delta then?!) ? So, only if all libs can work with Spark 3.1 we upgrade.

Not too relevant for Geni but AWS EMR currently ships with max Spark 3.0.1 and since most of my other coworkers use pyspark we mainly develop and test against that.

erp12 commented 3 years ago

There are definitely trade-offs, and I'm not confident enough in any particular approach to make a concrete pitch.

In my opinion, the appropriate strategy is different depending on the scope of Geni. What problems is Geni trying to solve vs not trying to solve? I will share my interpretation based on my reading of the design goals, the "why" page, and the Scicloj interview with @anthony-khong.

Geni is:

Geni is not:

Feel free to correct me on this interpretation! Assuming I am correct that the primary goal of Geni is to do data science, I would say that Geni does not need to stay in sync with Spark, as long as it is built against a Spark version that enables its data science features. Furthermore, it might make sense to keep the Spark version as stable as possible so that data scientists can safely upgrade Geni without needing to upgrade their infrastructure.

Alternatively, if the primary goal of Geni is simply providing "Spark for Clojure" then it is important to be agnostic to the specific Spark version. At a minimum, the Spark versions that have a decent market share should be supported.


Stepping back from the question of Spark versions, I think there is a related topic that could use some more clarity as the project matures. Data science is a broad field, and it means different things to different people. In addition to stating the kind of person Geni is going to help (data scientists), can we describe the specific kinds of tasks that Geni is trying to help people do?

For example, when I first opened this issue I wasn't sure if the functionality that Delta provides makes sense for Geni or not. I know that Delta (and the similar Iceberg project) have become important to the Spark ecosystem (especially for data engineering and DataOps), but do the problems that are solved by these technologies overlap with the problems Geni is trying to solve?

It seems like "should Geni do ___?" is a common question. In addition to this Delta discussion, we have open discussions for GraphFrames (#321) and Spark NLP (#323). Also, Geni has already been integrated with other with non-spark tools like tech.ml.dataset, Excel, and XGBoost.

Most tools pick a narrower, less fuzzy, scope than "data science" in general. Even within the Spark project they decompose into independent modules based on problem domain (SparkSQL, mllib, streaming, graphx). That said, Geni doesn't have to do the same. I don't think it is bad thing if Geni is an all-in-one, batteries included, kitchen-sink of a platform that provides a bunch of stuff that is typically useful to people who call themselves data scientists. There are lots of choices with lots of trade-offs that will inform future development.

@anthony-khong I would be interested to hear what your vision for Geni's scope. What problems is Geni trying to solve, and what problems is Geni not trying to solve? I know that might seem like a vague question, and I can provide more specific probing questions if it would be helpful. :)

dakra commented 3 years ago

jFYI Delta has released version 1.0 (https://delta.io/news/delta-lake-1-0-0-released/) with support for Spark 3.1.