wfau / gaia-dmp

Gaia data analysis platform
GNU General Public License v3.0
1 stars 5 forks source link

Document and Automate the Setup & management of the Spark local directory #112

Open stvoutsin opened 4 years ago

stvoutsin commented 4 years ago

We need to write down and automate how to setup, monitor and manage the Spark local directory https://spark.apache.org/docs/latest/configuration.html

spark.local.dir " Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system."

This is where Zeppelin & Spark store intermediate files when running in client mode. If this is set to a temporary directory and cleared, then this will cause exceptions when trying to run Zeppelin jobs that were previously started.

We know that by setting the directory to a non-temp directory issues like https://github.com/wfau/aglais/issues/108 no longer get triggered. Thus it looks like a possible solution is to set this value to a directory that we are then responsible for managing, which will have to get cleared during regular intervals, as the directory has the potential to grow fast. We need to figure out however when clearing this cache, how to avoid the problem we were having when using temporary directories, where Zeppelin expects to find intermediate files in this directory for open notebooks.

Zarquan commented 4 years ago

If we implement the session booking system currently being discussed in the design document, then we can link the /temp space to the lifetime of the booked session. When the session expires, we delete the /temp space ?

Zarquan commented 4 years ago

Should this be allocated once per user, once per session or once for each notebook ?