zero-one-group / geni

A Clojure dataframe library that runs on Spark
Apache License 2.0
283 stars 28 forks source link

Document Azure Blob Storage support #256

Open behrica opened 3 years ago

behrica commented 3 years ago

@behrica, that sounds awesome! Would you mind sharing what config options to pass, and we can add it to the docs or the README? 😄

Originally posted by @anthony-khong in https://github.com/zero-one-group/geni/issues/228#issuecomment-703344915

behrica commented 3 years ago

In which form should we address this ?

I would think about a continuation of https://github.com/zero-one-group/geni/blob/develop/docs/kubernetes_basic.md

At the end of the Kubernes setup, the next natural question is: How do I read data files ?

There is quite some options for it:

  1. Copy them "somehow" to all pods of the cluster (and the driver) in the same place and access as local files
  2. Mount Azure blob storage
  3. lot of potentially other options, which I have not tried:
    • hdfs
    • copy files on Kubernetes nodes and mount into pods
    • Use Azure Data Lake

All of this are a bit complex, and depend on he "concrete setup" . It might be required to change and re-build docker images to add dependencies and worry about a lot of other Spark/Kubernetes/Azure specific details.

And they have little to do with "geni" itself, and are somewhere else documented. From a "geni" point of view, only 2 things change:

  1. The options passed to g/create-spark-session
  2. The "url" of (g/read...) changes

Maybe the best is just to point this out at then end of kubernetes_basic.md, without giving a solution (because there are so many)

behrica commented 3 years ago

I added a chapter into the Kubernetes documentation accordingly:

https://github.com/behrica/geni/blob/develop/docs/kubernetes_basic.md

anthony-khong commented 3 years ago

Hi @behrica, thank you again for bringing this up.

I would think about a continuation of https://github.com/zero-one-group/geni/blob/develop/docs/kubernetes_basic.md

Yes, I think that's probably a good place to put it. I think some docs do get a bit long, which is fine.

And they have little to do with "geni" itself, and are somewhere else documented.

In which case, we can perhaps link to the documents?

I still think it's good to have an example of a working version. I'm happy to try out one of your examples and try to get it working!

I added a chapter into the Kubernetes documentation accordingly:

Would you like to make a PR? I believe there are some typos - I hope you're okay with it being reviewed 😄

behrica commented 3 years ago

I made a PR. I am not English mother tongue, so feel free to fix it

behrica commented 3 years ago

This does not add yet anything to read the files from storage. I am still figuring out how to do an example, which is neither too simplistic nor too complex.

All realistic examples, would need to assume the existence of some form of "cloud storage "of data.

anthony-khong commented 3 years ago

I am not English mother tongue, so feel free to fix it

I think it's great! I'm just reviewing the styling, so that it's a bit more consistent throughout the repo 😄

All realistic examples, would need to assume the existence of some form of "cloud storage "of data.

Ah I see, in which case it may become a bit too involved to setup. Could we work on some public Azure Storage files, but I'm not sure what's available out there.

behrica commented 3 years ago

I did a complete walk-through, which starts from "zero" up to analysing a 10GB CVS file stored in a newly created Azure File Storage with Geni on an AKS cluster. https://github.com/behrica/geni/blob/develop/docs/kubernetes_azureStorage.md

All commands are there, but I did not write any text yet.

@anthony-khong Do you have a way to try it out and give me some feedback ?

If you copy / paste all commands after each other it should all work.

Starting from scratch made it rather long, but like this it's easy reproducible.

anthony-khong commented 3 years ago

Hi @behrica, absolutely, I'll give it a go in the coming days and report back to you! It looks really neat!

anthony-khong commented 3 years ago

Hi @behrica, I've given it a go, and, as before, it works as expected! And I've never used Azure before, so that's great! Really looking forward to merging this.

I've got some comments and feedback.

Please let me know if you'd like me to chip in on some of these. I'd be very happy to work on it!

I think this a really cool guide. I would love to make a lein template that has everything here in it. All you do is make install and lein run, boom you're doing stuff on an Azure cluster.

behrica commented 3 years ago

Thanks for the feedback, I will take it on board when writing the text. There is one piece eventually missing, not sure what you think.

Working in a "kubectl exec" terminal is not the most comfortable experience of the world.

So what I do personally, is so start an nRepl in the driver node, instead of shelling into it.

To that one I can then connect remotely from Emacs or any other nRepl client. This works securely by using "kubectl port-forward"

I could add this as an optional step at the end.

anthony-khong commented 3 years ago

I think that's a really good point. Copying and pasting to the terminal is not the end of the world, but it would absolutely be a deal breaker for some people. I think if the kubectl port-forward approach works, we should add it in!

I think we can have a main-like script that gets executed during startup - it can look like Geni's main, but with a different [init-eval[(https://github.com/zero-one-group/geni/blob/develop/src/clojure/zero_one/geni/main.clj#L24) where we create a SparkSession that connects to an Azure cluster, we fix the port instead of picking a random one, then on a separate terminal instance we do kubectl port-forward? In any case, it is just a suggestion, please do whatever you think is best 😄

behrica commented 3 years ago

I think that's a really good point. Copying and pasting to the terminal is not the end of the world, but it would absolutely be a deal breaker for some people. I think if the kubectl port-forward approach works, we should add it in!

I think we can have a main-like script that gets executed during startup - it can look like Geni's main, but with a different [init-eval[(https://github.com/zero-one-group/geni/blob/develop/src/clojure/zero_one/geni/main.clj#L24) where we create a SparkSession that connects to an Azure cluster, we fix the port instead of picking a random one, then on a separate terminal instance we do kubectl port-forward? In any case, it is just a suggestion, please do whatever you think is best

I am still not sure, If I want to go the geni cli way. I can clearly see its advantages, a being single executable. Specially for new comers.

But how long will it take, until I want to add dependencies to it ?

behrica commented 3 years ago

I think that's a really good point. Copying and pasting to the terminal is not the end of the world, but it would absolutely be a deal breaker for some people. I think if the kubectl port-forward approach works, we should add it in!

I think we can have a main-like script that gets executed during startup - it can look like Geni's main, but with a different [init-eval[(https://github.com/zero-one-group/geni/blob/develop/src/clojure/zero_one/geni/main.clj#L24) where we create a SparkSession that connects to an Azure cluster, we fix the port instead of picking a random one, then on a separate terminal instance we do kubectl port-forward? In any case, it is just a suggestion, please do whatever you think is best

I did without using geni CLI, just by adding the nrepl start into the command used by the string container.

This "scenario" is as well a potential realistic usage scenario, in which the :

This is maybe still an enterprise scenario, as the Kubernetes cluster costs money, while existing. But it can be configured to autoscale down to 1 or 2 nodes, and then it cost little money.

behrica commented 3 years ago

We could potentially make a bash script, which does the whole setup "on keypress".

Including copy of a data file into the blob storage. (this can be the most time consuming part).

This would be more attractive for users, which don't want to have a long running Kubernetes / blob storage.

anthony-khong commented 3 years ago

I did without using geni CLI, just by adding the nrepl start into the command used by the string container.

Yes, this is exactly what I meant! I agree with you, Geni CLI serves a simple use case to get started up and running quickly (and most realistically on a local machine).

Instead of a bash script, what do you think about making a lein template where you could just do lein setup-azure and lein repl, which starts the nREPL server ready to be port-forwarded and connected to your text editor on your laptop. Not sure if lein is divisive now though, because many people have moved to Clojure CLI tools.

What's the most effective way for me to help you with this? It sounds like a great addition to the library, and I would love to chip in here.

behrica commented 3 years ago

I did without using geni CLI, just by adding the nrepl start into the command used by the string container.

Yes, this is exactly what I meant! I agree with you, Geni CLI serves a simple use case to get started up and running quickly (and most realistically on a local machine).

Instead of a bash script, what do you think about making a lein template where you could just do lein setup-azure and lein repl, which starts the nREPL server ready to be port-forwarded and connected to your text editor on your laptop. Not sure if lein is divisive now though, because many people have moved to Clojure CLI tools.

I thought about this. The setup script is mostly calls to "az", which could be easily shelled out to or even use the proper java/clojure client. But we have calls to "docker" in it, which seems to be strange to not to them in "bash"...

I have a "running bash script" ready, I will share with you and you can have a look. Then we can discuss, if we should bring it on a other form.

behrica commented 3 years ago

Please find here the working setup script:

https://github.com/behrica/geni/blob/azure_storage_doc/docs/azureSetup/setupKubernetes.sh

It does all the tasks as from here : https://github.com/behrica/geni/blob/azure_storage_doc/docs/kubernetes_azureStorage.md

in one go. In the beginning there are some parameters to be set, if needed. It downloads as well a 10G file at te end. This can take quite a while, 30 minutes or more.

After it finishes, you can do 2 port forwards (in different shells):

kubectl port-forward pod/geni 12345:12345 -n spark

and

kubectl port-forward pod/geni 4040:4040 -n spark

to have the nrepl port and the spark web-gui proxied on your local machine.

behrica commented 3 years ago

I am not sure, in which form the script could be re-used, either "as is" , or with some modifications.

The concrete setup has so many "moving parts", which a user want to do eventually differently. Maybe the best is to see it as a addition to the docu: https://github.com/behrica/geni/blob/azure_storage_doc/docs/kubernetes_azureStorage.md so, a user can just execute it in one go, without copy/paste the commands from docu to shell.