Closed JayParanjape closed 7 months ago
Hi, please try to add -c "checkpoint_pth"
to the training command, e.g., NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES="0,1,2,3" python -m torch.distributed.launch --nproc_per_node=4 --master_port 29502 train.py -p 29502 -d 0,1,2,3 -n "dataset_name" -c "checkpoint_pth"
. Thanks.
Thanks, it works! Closing the issue
Hi, Thanks for this amazing repo and work! Could you please guide me as to how to resume training from a checkpoint.