zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
283 stars 28 forks source link

added collect-to-arrow support #284

Closed behrica closed 3 years ago

behrica commented 3 years ago

refactored

more data types and refactored

more data types and refactored

make use of type hints

added docstring

clean up

adapted to latest upstream

coomit auto-changing file

codecov-io commented 3 years ago

Codecov Report

Merging #284 into develop will decrease coverage by 0.221%. The diff coverage is 93.750%.

Impacted file tree graph

@@              Coverage Diff               @@
##            develop      #284       +/-   ##
==============================================
- Coverage   100.000%   99.778%   -0.222%     
==============================================
  Files            36        37        +1     
  Lines          3050      3162      +112     
  Branches          0         7        +7     
==============================================
+ Hits           3050      3155      +105     
- Partials          0         7        +7     
Impacted Files Coverage Δ
src/clojure/zero_one/geni/arrow.clj 93.693% <93.693%> (ø)
src/clojure/zero_one/geni/core.clj 100.000% <100.000%> (ø)
anthony-khong commented 3 years ago

Thank you for the PR! This looks great! Let me review the code 😄

behrica commented 3 years ago

I addressed most of your points, thanks for comments.

I left function typed-action as-is. I will have a look, on how to address duplication while preserving type hints.

behrica commented 3 years ago

I merged manually all your cosmetic changes. So this branch should have them all.

behrica commented 3 years ago

Please have a look and lt me know, if fine. I can the rebase it as well to squash all commits

anthony-khong commented 3 years ago

Hi @behrica, I believe this is good to merge! Just one final thing to pass the CI jobs:

lein cljfmt fix src test doc

Then it should be good! Thank you for the awesome PR, this is a great addition to the library!

behrica commented 3 years ago

Should I write some lines here:

https://github.com/zero-one-group/geni/blame/6c540e4b19a0e888faecb8da4b5f8c52fc437c74/docs/cookbook/part_01_reading_and_writing_datasets.md#L122

to explain the different options Geni gives to "collect" data to the driver ? (or in an other place)

We could mention as well, that there is now tight integration with TMD, as an other form to "collect" data to work with it further.

behrica commented 3 years ago

I added some docu , please have a look

anthony-khong commented 3 years ago

Great! It's looking good! I'll merge as soon as the pipeline passes.

As for the docs, I suppose it's good to have. When we add TMD support in Geni, we should be able to do TMD-Spark interop from within Geni (i.e. no extra requires). When that happens, we just change the doc!

Again, thank you so much for this PR!!