zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
281 stars 28 forks source link

Adds partial Delta Lake support #328

Open erp12 opened 3 years ago

erp12 commented 3 years ago

Adds support for Delta Lake storage format. This PR is best reviewed 1 commit at a time.

Currently, some core functionality of Delta is disabled due to a known incompatibility between Spark 3.1 and Delta 0.7+. Once these issues are addressed in the next Delta release, the commented out code in this PR should be uncommented. All commented out code has been tested on Spark 3.0.2 + Delta 0.8.

I chose to implement the Delta API as using with-dynamic-import, but I am not confident in this decision and would love feedback. My editor fails to recognize any symbol defined within with-dynamic-import which hinders navigation and refactoring.

Some things I would like to do before calling this PR done:

dakra commented 2 years ago

@erp12 Can you update the code for delta 1.0.0.

I use this current PR sporadically for ad-hoc analysis and works well. Thanks for it :)

dakra commented 2 years ago

Just want to mention that there is now already delta 1.1 which supports Spark 3.2.

dakra commented 1 year ago

@erp12 @anthony-khong For delta there is now version 2.1.1 which supports Spark version 3.3

It's obviously difficult do keep everything up-to-date and compatible with each other. Especially when even minor version increases apparently break things. ftw, at least delta and AWS EMR support Spark 3.3 which was released June.

Are there plans to upgrade the geni spark version to 3.3? And if that's done, we could consider merging this PR as well. I'm using it pretty regularly for personal queries/debugging things and works pretty well.

Thanks, Daniel