Open jriekhof opened 3 years ago
Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!
Hi Anthony,
thank you for your quick reply! I figured that I got to this due to a small glitch in Cookbook Chapter 4, where it reads the initial csv without {:kebab-columns true}. Just adding that "fixed" it and also is consistent with the chapter sample output. However, still interesting why it does not work without the kebap special char mangling, I believe it should.
Just started with spark and geni the other day, and I can already say it is one of the best APIs I used in years, congrats! Such a lot of fun! I will for sure continue to use it. It also seems to be quite fast even on my local machine.
Hope you find the cause of this quickly, have a good day!
Ciao
...Jochen
Am 25.02.2021 um 23:10 schrieb Anthony Khong notifications@github.com:
Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786261238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETQSM7C7BHKQARJFKPTTA3DEHANCNFSM4YEW56FQ.
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/
Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
Hi Anthony...
thank you for digging into this! Yes you are right I think, it looks like it is indeed the described issue.
My feeling is to just document this in the geni docs and recommend kebab-case in these cases. It fixes it very well in the Cookbook example :-).
Ciao
...Jochen
Am 26.02.2021 um 11:46 schrieb Anthony Khong notifications@github.com:
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786568182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ.
Hi Anthony...
just a quick followup on our recent discussion regarding the dot/backticks column name issue.
I read up a little about this in spark docs, in Spark SQL docs I found Spark to be really restrictive, allowing just letters (a-zA-Z), digits and underscore. https://spark.apache.org/docs/latest/sql-ref-identifier.html https://spark.apache.org/docs/latest/sql-ref-identifier.html
So, your :kebab-columns is doing it perfect for clojure I think. An issue could be present for people exporting parquet files for use in other languages, that like underscores better.
Instead of :kebab-columns true one could use something like :snake-columns true. Using {:convert-column-names :kebab | :snake} might be a bit more elegant, but well, this would change the API :-).
I appended my repl code (reusing your data-sources code in case you would like to play with it?!
Have a good day Anthony!
Ciao
...Jochen
(require '[zero-one.geni.core.data-sources :as ds] '[zero-one.geni.interop :as interop] 'camel-snake-kebab.core)
(defn ->normalized-columns "Returns a new Dataset with all columns renamed using passed rename-fn." [dataset rename-fn] (let [remove-punctuations #'ds/remove-punctuations ; access privates deaccent #'ds/deaccent new-columns (->> dataset .columns (map remove-punctuations) (map deaccent) (map rename-fn))] (.toDF dataset (interop/->scala-seq new-columns))))
(comment
; plain
(-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv"))
(g/select "Precip. Amount (mm)
"))
; kebap-columns (-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv")) (->normalized-columns camel-snake-kebab.core/->kebab-case) (g/select "precip-amount-mm"))
; snake-columns (-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv")) (->normalized-columns camel-snake-kebab.core/->snake_case) (g/select "precip_amount_mm")) )
Am 26.02.2021 um 11:46 schrieb Anthony Khong notifications@github.com:
Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.
On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786568182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ.
Info
Problem / Steps to reproduce
Standard lein new geni …, bitnami/spark 3.0.2 docker, then used code from geni cookbook chapter 4.
The following code from cookbook example 4 fails with ArityException:
Exception is
Also manual select with column named "Precip. Amount (mm)" does not work. It seems that they have backticks around them internally.
I tried to rename all columns with
but the problem persists, still backticks. Crashes:
(g/select raw-weather-mar-2012 "Precip. Amount (mm)")
Works:(g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")
(g/column-names (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`"))
yields "Precip. Amount (mm)" without backticks.This lead me to believe that there is some issue in geni or spark with these column names.