zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
283 stars 28 forks source link

spark-session cannot be changed (any more ?) #301

Open behrica opened 3 years ago

behrica commented 3 years ago

Following the minikube guide: https://github.com/zero-one-group/geni/blame/develop/docs/kubernetes_basic.md

the verification of line 118 fails.

It seems that I cannot change the spark-session, by calling g/create-spark-session

I am pretty sure, that it worked at one moment.

behrica commented 3 years ago

I just saw that this is not a Delay any more.

So we get the SparkSession initialized once we require the name space. And I suppose, it cannot be changed any more.

So it cannot be re-configured.

anthony-khong commented 3 years ago

I see... I think we can change it back to a delay. Would you like to make a PR for that?

behrica commented 3 years ago

Maybe there is a better way.

Maybe the default "configuration map" for the session https://github.com/behrica/geni/blob/482c4b934f037d32b849916211b509c94d89800e/src/clojure/zero_one/geni/defaults.clj#L5

Could become an "atom" , which can be changed if needed before requiring the default name space 'zero-one.geni.core'

I think that the current feature to potentially change the session itself is not super usefull, because Spark does not really support this cleanly, correct ? If I have read it right, the spark session is meant to be instantiated ones in the lifetime of a JVM. I can try this out to see if it works.

behrica commented 3 years ago

I think it could work this way.

The issue would be to keep the "full automatic" session configuration of the geni-cli. My opinion is, that the current way of the geni cli session initialization, which:

is brittle as it will not work always and depends on "order" of requiring ns / using functions.

I think we have three options for this:

  1. Not have it full automatic, but a methods which needs to be called (init-default-spark) or similar -> this could then allow changing config settings

  2. Allow to change spark session configuration from outside repl by either:

    • read a config file
    • take config options on the geni.sh

I still like the overall idea of the geni CLI as a quick user friendly entry point, but it needs to allow arbitrary session configs. (or we do not allow any custom session config for the geni cli, and see it as a "demo") The other spark shells can be fully configured from command line (and do neither allow to change session from inside)

erp12 commented 3 years ago

Here is a link to our previous discussions for reference

I think that the current feature to potentially change the session itself is not super usefull, because Spark does not really support this cleanly, correct ? If I have read it right, the spark session is meant to be instantiated ones in the lifetime of a JVM.

You are correct. Typically, a user's spark session settings would be set during the call to spark-submit. The default session settings in geni will only be applied if no call to spark-submit is made (ie. running locally).

Most Spark usage (across all languages) happens by launching a spark "application" (for example, a Geni REPL) on an existing spark cluster. It is not expected that the spark application creates it's own cluster, and thus the session config is supplied when the .jar and main class are specified.

I'm not too familiar with Kubernetes, so I am having trouble following the guide. It looks like the Geni CLI is being started outside of spark-submit. I think the more traditional pattern would be to call spark-submit in the container for the cluster's driver and pass an uberjar of Geni and --class zero-one.geni.main along with any other spark session config you want.

I have had success with starting Geni REPLs on flintrock clusters using spark-submit.