Open priyanka-surana opened 1 year ago
Information about GPUs from Martin Prete, Cellular Genetics
Fram5 has the following GPU-queues: QUEUE MEMLIMIT RUNLIMIT
gpu-normal 683.5 G 720.0 min
gpu-basement 683.5 G 20160.0 min gpu-huge 683.5 G 720.0 minThe NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), and it’s a quick way of getting stats of GPUs. Think of it as the ps or top command but for GPUs. The following examples are based on that command and the docker image: nvidia/cuda:11.3.1-runtime-ubuntu20.04 so you'll need to first pull that (or adatap that to your nextlfow-whatever for testing)
singularity pull nvidia-cuda-11.3.1-runtime-ubuntu20.04.sif docker://nvidia/cuda:11.3.1-runtime-ubuntu20.04
Singularity containerOptions: Singulairty needs the option
[--nv](https://docs.sylabs.io/guides/3.8/user-guide/gpu.html)
to use GPUs and setup the container’s environment to use an basic CUDA libraries (mostly talk to the GPU driver). It’s easy to miss and wonder “why am I not using them?”. My guess is that should be added to container options. As far as I know if you run singularity with--nv
on a host without GPUs you won’t get an error but the info message of “INFO: Could not find any nv files on this host!”. If you’ve used GPUs with docker containers in the past thing of--nv
as docker’s--gpus all
Example: singularity run --nv nvidia-cuda-11.3.1-runtime-ubuntu20.04.sif nvidia-smiif you get “NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.” That means that the host doesn’t have GPUs (or the nvidia driver was updated but not fully so the host needs rebooting, funny innit?)
LSF bsub -gpu: When submitting a job via bsub that requires GPU, specifies properties of GPU resources required by the job need to be explicitly told to LSF. That’s done with “-gpu”. The bear minimum you need to add looks like:
bsub -gpu script.sh
However, it’s recommended to also specify the number of GPUs you’ll be using and the memory, so it ends up looking something like likebsub -gpu "num=1:gmem=8000”
(gmem is optional if not specified you’ll be able to use as much as it’s free on the GPU) A functional example using the previous singularity command would be something like:bsub -q gpu-normal \ -n1 \ -M2000 \ -R"select[mem>2000] rusage[mem=2000]" \ -gpu “num=1:gmem=4000” \ -Is singularity run --nv nvidia-cuda-11.3.1-runtime-ubuntu20.04.sif nvidia-smi
Some additional trickery here. Although 80% of the hosts have the same GPU models. (all V100s), some of them have 16GB of GPU RAM and others 32GB. The list goes like this:
QUEUE HOST GPU gpu-normal farm5-gpu0101 Tesla V100-SXM2-32GB gpu-normal farm5-gpu0103 Tesla V100-SXM2-32GB gpu-normal & farm5-gpu0102 Tesla V100-SXM2-32GB gpu-basement gpu-normal & farm5-gpu0104 Tesla V100-SXM2-32GB gpu-basement gpu-normal & farm5-gpu0105 Tesla V100-SXM2-32GB gpu-basement gpu-huge dgx-c11-01 Tesla V100-SXM2-32GB gpu-huge dgx-c11-02 Tesla V100-SXM2-16GB
Don’t think that’s something that needs to go in the config but it’s something for you to keep in mind if you start seeing “CUDA couldn’t reserve memory” issues.
Discussion on the --no-home
option by Matthieu Muffato, Tree of Life and Martin Prete, Cellular Genetics
The other issue was about the
--no-home
option clashing with-B /nfs
and that allowing users installed stuff to make its way sneakily into the environment. I thought no-home was smarter then, just “don’t auto-mount home to the container”. I was wrong, it’s not. The workaround I can think of to have both things is this: singularity.runOptions = '--bind /lustre --bind /nfs --bind /tmp:/nfs/users' That binds /nfs so we get all the goodies, and then binds /tmp to /nfs/users effectively making all the home folders unavailable from the bound path but writing to your home folder under “/tmp/nfs_x/xxx/”.However if you want none of that you can bring back --no-home and use an empty folder like /mnt singularity.runOptions = '--bind /lustre --bind /nfs --no-home --bind /mnt:/nfs/users' That way you’d get a read-only empty folder mounted on /nfs/users; no trace of /nfs_x/xxx anywhere.
Regarding
--no-home
, the difference is whether $HOME is read+write or read-only in the container. In both cases, we're still making it empty.I’ve found that singularity has a
--home
option to specify which directory should be considered the home directory. So I think another workaround would be singularity.runOptions = '--bind /lustre --bind /nfs --home /tmp' which changes $HOME, how ~ is substituted, etc. The original home directory in /nfs/users is still visible, but it has no meaning.
Currently, there is a limited nf-core config for the Sanger farm – https://github.com/nf-core/configs/blob/master/conf/sanger.config.
This ticket is to update the config to include the different queues and add compatibility for GPUs.
Information on Queues: