zhijian-liu / torchpack

A neural network training interface based on PyTorch, with a focus on flexibility
https://pypi.org/project/torchpack/
MIT License
61 stars 15 forks source link

torchpack usage on multiple nodes on slurm cluster #17

Closed kavisha725 closed 3 years ago

kavisha725 commented 3 years ago

Thanks for providing this package. I am successfully able to use the torchpack dist-run -np ${_ngpu} command on a slurm cluster when using only 1 node. Could you please explain how to use this with multiple nodes. I assume it involves setting the --hosts parameter but I'm not able to figure out how to identify the allocated nodes from the slurm script.

kavisha725 commented 3 years ago

I've figured it out. I'lll add my slurm script below and close this issue. I had to manually create the --hosts parameter into the format required by launchers/drunner.py. Let me know if there's a more elegant way of doing this.

#!/bin/bash
#SBATCH --time=23:00:00
#SBATCH --mem=64gb
#SBATCH --nodes=3
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4

node_1=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 1p)
echo "node_1="$node_1
node_2=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 2p)
echo "node_2="$node_2
node_3=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 3p)
echo "node_3="$node_3
_HOSTS="${node_1}:4,${node_2}:4,${node_3}:4"
echo "hosts="$_HOSTS

source source ~/anaconda3/etc/profile.d/conda.sh
conda activate myenv

_NGPU=12

torchpack dist-run -np ${_NGPU} -H ${_HOSTS} -v python train.py
IAMShashankk commented 2 years ago

@kavisha725 @zhijian-liu What about if I just want to use one node and multiple GPUs. I have a cluster with multiple nodes. Let's say I want to use one of the clusters named ABC which has 50 RTX6000 GPUs. To allocate multiple GPUS I use command: srun -K -N 1 --ntasks=1 --gpus-per-task=6 --cpus-per-gpu=2 -p RTX6000 --mem-per-gpu=30G and to launch the tain/test: torchpack dist-run -np 6 python tools/test.py configs pretrained.pth --eval bbox This results in the 'there are not enough slots....' error. Whereas if I only use one GPU with -np 1 everything works fine.

kavisha725 commented 2 years ago

Could you try the same using a batch job (ie. sbatch instead of srun) and let me know?

YoushaaMurhij commented 1 year ago

Hi! I am facing the same problem when np > 1. I have tried srun and sbatch and both are not working! Any suggestions? Thanks!

zhijian-liu commented 1 year ago

@kavisha725 Do you have any idea on how to resolve this problem? Thanks!

kavisha725 commented 1 year ago

Hi @YoushaaMurhij , it's hard for me to answer this without knowing the specifics of your system but I can point you towards how to debug. Please double check what resources you are allocated from slurm and that this information is fed into torchpack in the format required by launchers/drunner.py.

YoushaaMurhij commented 1 year ago

@kavisha725, Thanks for your response! Here's my slurm_train script:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --partition=DGX-1v100
#SBATCH --nodes=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=100000
#SBATCH --gres=gpu:6
#SBATCH --job-name=bevnet
#SBATCH --mail-user=
#SBATCH --mail-type=END
#SBATCH --comment="---"
srun train.sh   

and in train.sh:

      set -x
      nvidia-smi;
      free -m;
      cd /home/trainer/BEVFusion ;

      ##### For BEVFusion detection model:
      torchpack dist-run -np 6 python3 tools/train.py \
            configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml \
            --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth \
            --load_from pretrained/lidar-only-det.pth 

Both nvidia-smi and free -m are showing what I have allocated in SBATCH.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:0A:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   28C    P0    43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:85:00.0 Off |                    0 |
| N/A   29C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   30C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   31C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:8A:00.0 Off |                    0 |
| N/A   29C    P0    41W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
+ free -m
              total        used        free      shared  buff/cache   available
Mem:         515870        9350      291655         197      214864      503079
Swap:          4095           3        4092

And everything works fine when np = 1. So, I assume that the container should not be a problem. Adding -H "$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 1p):6" did not help.

YoushaaMurhij commented 1 year ago

I could not solve this issue. I am now using torch.distributed.launch.

narimanmadani commented 1 year ago

@YoushaaMurhij Did you remove torchpack lib dependencies and did it work?

YoushaaMurhij commented 1 year ago

torch.distributed.launch works. I did not remove torchpack. Just repalced it with torch.distributed.launch. With newer Pytorch versions, use torchrun

bbzh commented 1 year ago

torch.distributed.launch works. I did not remove torchpack. Just repalced it with torch.distributed.launch. With newer Pytorch versions, use torchrun

Hi @YoushaaMurhij , would you please share detailed steps for the change from torchpack to torch.distributed.launch? Thank you!

Estrellama commented 11 months ago

Hi @YoushaaMurhij, i met the same question with you, Would you please share the detailed steps for changing from torchpack to torch.distributed.launch? Thanks!

YoushaaMurhij commented 11 months ago

Sorry, I do not have the code in hand anymore. But I followed PyTorch documentation. Step-by-step examples are available.