Closed kavisha725 closed 3 years ago
I've figured it out. I'lll add my slurm script below and close this issue. I had to manually create the --hosts
parameter into the format required by launchers/drunner.py
. Let me know if there's a more elegant way of doing this.
#!/bin/bash
#SBATCH --time=23:00:00
#SBATCH --mem=64gb
#SBATCH --nodes=3
#SBATCH --cpus-per-task=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
node_1=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 1p)
echo "node_1="$node_1
node_2=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 2p)
echo "node_2="$node_2
node_3=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 3p)
echo "node_3="$node_3
_HOSTS="${node_1}:4,${node_2}:4,${node_3}:4"
echo "hosts="$_HOSTS
source source ~/anaconda3/etc/profile.d/conda.sh
conda activate myenv
_NGPU=12
torchpack dist-run -np ${_NGPU} -H ${_HOSTS} -v python train.py
@kavisha725 @zhijian-liu What about if I just want to use one node and multiple GPUs.
I have a cluster with multiple nodes. Let's say I want to use one of the clusters named ABC which has 50 RTX6000 GPUs.
To allocate multiple GPUS I use command:
srun -K -N 1 --ntasks=1 --gpus-per-task=6 --cpus-per-gpu=2 -p RTX6000 --mem-per-gpu=30G
and to launch the tain/test:
torchpack dist-run -np 6 python tools/test.py configs pretrained.pth --eval bbox
This results in the 'there are not enough slots....' error.
Whereas if I only use one GPU with -np 1 everything works fine.
Could you try the same using a batch job (ie. sbatch
instead of srun
) and let me know?
Hi! I am facing the same problem when np > 1. I have tried srun
and sbatch
and both are not working!
Any suggestions?
Thanks!
@kavisha725 Do you have any idea on how to resolve this problem? Thanks!
Hi @YoushaaMurhij , it's hard for me to answer this without knowing the specifics of your system but I can point you towards how to debug. Please double check what resources you are allocated from slurm and that this information is fed into torchpack in the format required by launchers/drunner.py
.
@kavisha725, Thanks for your response! Here's my slurm_train script:
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --partition=DGX-1v100
#SBATCH --nodes=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=100000
#SBATCH --gres=gpu:6
#SBATCH --job-name=bevnet
#SBATCH --mail-user=
#SBATCH --mail-type=END
#SBATCH --comment="---"
srun train.sh
and in train.sh
:
set -x
nvidia-smi;
free -m;
cd /home/trainer/BEVFusion ;
##### For BEVFusion detection model:
torchpack dist-run -np 6 python3 tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth \
--load_from pretrained/lidar-only-det.pth
Both nvidia-smi
and free -m
are showing what I have allocated in SBATCH
.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:0A:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:0B:00.0 Off | 0 |
| N/A 28C P0 43W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:85:00.0 Off | 0 |
| N/A 29C P0 42W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:86:00.0 Off | 0 |
| N/A 30C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 31C P0 42W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 29C P0 41W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
+ free -m
total used free shared buff/cache available
Mem: 515870 9350 291655 197 214864 503079
Swap: 4095 3 4092
And everything works fine when np = 1. So, I assume that the container should not be a problem.
Adding -H "$(scontrol show hostnames "$SLURM_JOB_NODELIST" | sed -n 1p):6"
did not help.
I could not solve this issue. I am now using torch.distributed.launch.
@YoushaaMurhij Did you remove torchpack lib dependencies and did it work?
torch.distributed.launch works. I did not remove torchpack. Just repalced it with torch.distributed.launch. With newer Pytorch versions, use torchrun
torch.distributed.launch works. I did not remove torchpack. Just repalced it with torch.distributed.launch. With newer Pytorch versions, use torchrun
Hi @YoushaaMurhij , would you please share detailed steps for the change from torchpack to torch.distributed.launch? Thank you!
Hi @YoushaaMurhij, i met the same question with you, Would you please share the detailed steps for changing from torchpack to torch.distributed.launch? Thanks!
Sorry, I do not have the code in hand anymore. But I followed PyTorch documentation. Step-by-step examples are available.
Thanks for providing this package. I am successfully able to use the torchpack dist-run -np ${_ngpu} command on a slurm cluster when using only 1 node. Could you please explain how to use this with multiple nodes. I assume it involves setting the --hosts parameter but I'm not able to figure out how to identify the allocated nodes from the slurm script.