workflows-community / june_2023_hackathon

Work with other Nextflow developers on your own work projects or community challenges.
0 stars 0 forks source link

Sanger nf-core config #1

Open priyanka-surana opened 1 year ago

priyanka-surana commented 1 year ago

Currently, there is a limited nf-core config for the Sanger farm – https://github.com/nf-core/configs/blob/master/conf/sanger.config.

This ticket is to update the config to include the different queues and add compatibility for GPUs.

Information on Queues:

priyanka-surana commented 1 year ago

Information about GPUs from Martin Prete, Cellular Genetics

Fram5 has the following GPU-queues: QUEUE MEMLIMIT RUNLIMIT
gpu-normal 683.5 G 720.0 min
gpu-basement 683.5 G 20160.0 min gpu-huge 683.5 G 720.0 min

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), and it’s a quick way of getting stats of GPUs. Think of it as the ps or top command but for GPUs. The following examples are based on that command and the docker image: nvidia/cuda:11.3.1-runtime-ubuntu20.04 so you'll need to first pull that (or adatap that to your nextlfow-whatever for testing) singularity pull nvidia-cuda-11.3.1-runtime-ubuntu20.04.sif docker://nvidia/cuda:11.3.1-runtime-ubuntu20.04

Singularity containerOptions: Singulairty needs the option [--nv](https://docs.sylabs.io/guides/3.8/user-guide/gpu.html) to use GPUs and setup the container’s environment to use an basic CUDA libraries (mostly talk to the GPU driver). It’s easy to miss and wonder “why am I not using them?”. My guess is that should be added to container options. As far as I know if you run singularity with --nv on a host without GPUs you won’t get an error but the info message of “INFO: Could not find any nv files on this host!”. If you’ve used GPUs with docker containers in the past thing of --nv as docker’s --gpus all Example: singularity run --nv nvidia-cuda-11.3.1-runtime-ubuntu20.04.sif nvidia-smi

if you get “NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.” That means that the host doesn’t have GPUs (or the nvidia driver was updated but not fully so the host needs rebooting, funny innit?)

LSF bsub -gpu: When submitting a job via bsub that requires GPU, specifies properties of GPU resources required by the job need to be explicitly told to LSF. That’s done with “-gpu”. The bear minimum you need to add looks like: bsub -gpu script.sh However, it’s recommended to also specify the number of GPUs you’ll be using and the memory, so it ends up looking something like like bsub -gpu "num=1:gmem=8000” (gmem is optional if not specified you’ll be able to use as much as it’s free on the GPU) A functional example using the previous singularity command would be something like:

bsub -q gpu-normal \
-n1 \
-M2000 \
-R"select[mem>2000] rusage[mem=2000]" \
-gpu “num=1:gmem=4000” \
-Is singularity run --nv nvidia-cuda-11.3.1-runtime-ubuntu20.04.sif nvidia-smi

Some additional trickery here. Although 80% of the hosts have the same GPU models. (all V100s), some of them have 16GB of GPU RAM and others 32GB. The list goes like this:

QUEUE           HOST         GPU
gpu-normal    farm5-gpu0101  Tesla V100-SXM2-32GB
gpu-normal    farm5-gpu0103  Tesla V100-SXM2-32GB 
gpu-normal &  farm5-gpu0102  Tesla V100-SXM2-32GB
gpu-basement
gpu-normal & farm5-gpu0104  Tesla V100-SXM2-32GB
gpu-basement
gpu-normal & farm5-gpu0105  Tesla V100-SXM2-32GB
gpu-basement
gpu-huge      dgx-c11-01     Tesla V100-SXM2-32GB
gpu-huge      dgx-c11-02     Tesla V100-SXM2-16GB

Don’t think that’s something that needs to go in the config but it’s something for you to keep in mind if you start seeing “CUDA couldn’t reserve memory” issues.

priyanka-surana commented 1 year ago

Discussion on the --no-home option by Matthieu Muffato, Tree of Life and Martin Prete, Cellular Genetics

The other issue was about the --no-home option clashing with -B /nfs and that allowing users installed stuff to make its way sneakily into the environment. I thought no-home was smarter then, just “don’t auto-mount home to the container”. I was wrong, it’s not. The workaround I can think of to have both things is this: singularity.runOptions = '--bind /lustre --bind /nfs --bind /tmp:/nfs/users' That binds /nfs so we get all the goodies, and then binds /tmp to /nfs/users effectively making all the home folders unavailable from the bound path but writing to your home folder under “/tmp/nfs_x/xxx/”.

However if you want none of that you can bring back --no-home and use an empty folder like /mnt singularity.runOptions = '--bind /lustre --bind /nfs --no-home --bind /mnt:/nfs/users' That way you’d get a read-only empty folder mounted on /nfs/users; no trace of /nfs_x/xxx anywhere.

Regarding --no-home, the difference is whether $HOME is read+write or read-only in the container. In both cases, we're still making it empty.

I’ve found that singularity has a --home option to specify which directory should be considered the home directory. So I think another workaround would be singularity.runOptions = '--bind /lustre --bind /nfs --home /tmp' which changes $HOME, how ~ is substituted, etc. The original home directory in /nfs/users is still visible, but it has no meaning.