torchpack usage on multiple nodes on slurm cluster

Thanks for providing this package. I am successfully able to use the torchpack dist-run -np ${_ngpu} command on a slurm cluster when using only 1 node. Could you please explain how to use this with multiple nodes. I assume it involves setting the --hosts parameter but I'm not able to figure out how to identify the allocated nodes from the slurm script.

I've figured it out. I'lll add my slurm script below and close this issue. I had to manually create the --hosts parameter into the format required by launchers/drunner.py. Let me know if there's a more elegant way of doing this.

#SBATCH --time=23:00:00
#SBATCH --mem=64gb
#SBATCH --nodes=3
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4

node_1=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 1p)
echo "node_1="$node_1
node_2=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 2p)
echo "node_2="$node_2
node_3=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 3p)
echo "node_3="$node_3
echo "hosts="$_HOSTS

source source ~/anaconda3/etc/profile.d/conda.sh
conda activate myenv


torchpack dist-run -np ${_NGPU} -H ${_HOSTS} -v python train.py
@kavisha725 @zhijian-liu What about if I just want to use one node and multiple GPUs. I have a cluster with multiple nodes. Let's say I want to use one of the clusters named ABC which has 50 RTX6000 GPUs. To allocate multiple GPUS I use command: srun -K -N 1 --ntasks=1 --gpus-per-task=6 --cpus-per-gpu=2 -p RTX6000 --mem-per-gpu=30G and to launch the tain/test: torchpack dist-run -np 6 python tools/test.py configs pretrained.pth --eval bbox This results in the 'there are not enough slots....' error. Whereas if I only use one GPU with -np 1 everything works fine.

Could you try the same using a batch job (ie. sbatch instead of srun) and let me know?

Hi! I am facing the same problem when np > 1. I have tried srun and sbatch and both are not working! Any suggestions? Thanks!

@kavisha725 Do you have any idea on how to resolve this problem? Thanks!

Hi @YoushaaMurhij , it's hard for me to answer this without knowing the specifics of your system but I can point you towards how to debug. Please double check what resources you are allocated from slurm and that this information is fed into torchpack in the format required by launchers/drunner.py.

@kavisha725, Thanks for your response! Here's my slurm_train script:

#SBATCH --ntasks=1
#SBATCH --partition=DGX-1v100
#SBATCH --nodes=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=100000
#SBATCH --gres=gpu:6
#SBATCH --job-name=bevnet
#SBATCH --mail-user=
#SBATCH --mail-type=END
#SBATCH --comment="---"
srun train.sh   

and in train.sh:

      set -x
      free -m;
      cd /home/trainer/BEVFusion ;

      ##### For BEVFusion detection model:
      torchpack dist-run -np 6 python3 tools/train.py \
            configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml \
            --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth \
            --load_from pretrained/lidar-only-det.pth 

Both nvidia-smi and free -m are showing what I have allocated in SBATCH.

And everything works fine when np = 1. So, I assume that the container should not be a problem. Adding -H "$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 1p):6" did not help.

I could not solve this issue. I am now using torch.distributed.launch.

@YoushaaMurhij Did you remove torchpack lib dependencies and did it work?

torch.distributed.launch works. I did not remove torchpack. Just repalced it with torch.distributed.launch. With newer Pytorch versions, use torchrun

torch.distributed.launch works. I did not remove torchpack. Just repalced it with torch.distributed.launch. With newer Pytorch versions, use torchrun

Hi @YoushaaMurhij , would you please share detailed steps for the change from torchpack to torch.distributed.launch? Thank you!

Hi @YoushaaMurhij, i met the same question with you, Would you please share the detailed steps for changing from torchpack to torch.distributed.launch? Thanks!

Sorry, I do not have the code in hand anymore. But I followed PyTorch documentation. Step-by-step examples are available.