Closed Elijah-Yi closed 2 years ago
and I found the stuck occurs at loss backward
It seems like something went wrong with NCCL. I have not met such problem
It seems like something went wrong with NCCL. I have not met such problem
thanks for your reply, I have already, solved the problem of stuck. Line 303 in MAT.py, The gradient are not calculated and updated of loss_pack2=self.train_batch(Xaug,y,jump_aux=False)
You will notice the commented lines around Line 303 in MAT.py. They would disable Sync_BN in the second forward for avoiding bugs in some versions of pytorch.
Hi @yoctta thanks for your contributions. when I training MAT, The training is stuck, and it's not over. I checked a lot, but I didn't find the relevant information. CPU shows running, Do you know what the problem is?