zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
281 stars 28 forks source link

Allow structs in g/->dataset #335

Closed WaqasAliAbbasi closed 2 years ago

WaqasAliAbbasi commented 2 years ago

Info

Info Value
Geni Version 0.0.38

Problem / Steps to reproduce

Cannot use structs or list of structs in g/->dataset, example:

(-> (g/->dataset [{:id "bob"   :skills ["javascript" "clojure"] :preference {:remote true}  :experience [{:name "CompanyA" :years 2}
                                                                                                           {:name "CompanyB" :years 1}]}
                    {:id "alice" :skills ["python" "c++"]         :experience []}])
      g/show)

leads to

; Execution error (ClassCastException) at org.apache.spark.sql.catalyst.expressions.CastBase/buildCast (Cast.scala:295).
; class clojure.lang.Keyword cannot be cast to class [B (clojure.lang.Keyword is in unnamed module of loader 'app'; [B is in module java.base of loader 'bootstrap')

Running g/print-schema right before g/show above shows incorrect schema:

root
 |-- id: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- preference: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: binary (containsNull = true)
 |-- experience: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: array (containsNull = true)
 |    |    |    |-- element: binary (containsNull = true)

Expected Result

g/show should output:

+-----+---------------------+----------+------------------------------+
|id   |skills               |preference|experience                    |
+-----+---------------------+----------+------------------------------+
|bob  |[javascript, clojure]|{true}    |[{CompanyA, 2}, {CompanyB, 1}]|
|alice|[python, c++]        |null      |[]                            |
+-----+---------------------+----------+------------------------------+

g/print-schema should output:

root
 |-- id: string (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- preference: struct (nullable = true)
 |    |-- remote: boolean (nullable = true)
 |-- experience: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- years: long (nullable = true)

Proposed Solution

  1. Check for map? in infer-spark-type at https://github.com/zero-one-group/geni/blob/123aec55351adad893907ea4284e5b2e0ce43f45/src/clojure/zero_one/geni/core/dataset_creation.clj#L130
  2. If any maps exist in table, transform to array of values https://github.com/zero-one-group/geni/blob/123aec55351adad893907ea4284e5b2e0ce43f45/src/clojure/zero_one/geni/core/dataset_creation.clj#L165
WaqasAliAbbasi commented 2 years ago

I have managed to implement the solution locally, just need to add some tests.

anthony-khong commented 2 years ago

Hi @WaqasAliAbbasi, thanks for this! If I recall correctly, I did not think about this case. I'll check out your PR!

Are you using Geni as a dependency? Will you need a new release for these changes?

WaqasAliAbbasi commented 2 years ago

Hi @WaqasAliAbbasi, thanks for this! If I recall correctly, I did not think about this case. I'll check out your PR!

Are you using Geni as a dependency? Will you need a new release for these changes?

Thanks for looking into the PR, will update it soon.

Yes we use geni heavily in our team (its such a neat library, thanks for your work 😃) Will appreciate it if you push a release once this change is in.

anthony-khong commented 2 years ago

Thank you very much for the kind words. I'm really happy to hear that!

Please let me know once you're happy with the PR, I'll merge it and create a release!