Closed Chzzi closed 1 year ago
This problem occurs because in ddp mode, setting drop_last=True and single-gpu batch size=8 causes one gpu to have no data in the last iteration, while the others do. This prevents the process from synchronizing. You can solve this problem by setting drop_last=False. We will fix this bug as soon as possible. Also, it is highly recommended to set: “--finetune ./Checkpoints/NTU-RGBD-32-DTNV2-TSM/model_best.pth.tar” will help improve performance
Thanks for your answer, the training process seems to be working now, thanks a lot
I'm trying to train an RGB model on the IsoGD dataset with the following script:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 --use_env train.py --config config/IsoGD.yml --data ~/dataset/IsoGD_imgs --splits data/dataset_splits/IsoGD/rgb --save ./train_IsoGD_rgb/ --batch-size 8 --sample-duration 32 --smprob 0.2 --mixup 0.8 --shufflemix 0.3 --epochs 100 --distill 0.2 --type M --intar-fatcer 2
But I got NCCL error at the end of the first training epoch with the following log: