Training stuck for MAT - Githubissues

yoctta / multiple-attention

The code of multi-attention deepfake detection

241 stars 54 forks source link

Training stuck for MAT #21

Closed Elijah-Yi closed 2 years ago

Elijah-Yi commented 2 years ago

Hi @yoctta thanks for your contributions. when I training MAT, The training is stuck, and it's not over. I checked a lot, but I didn't find the relevant information. CPU shows running, Do you know what the problem is?

Elijah-Yi commented 2 years ago

and I found the stuck occurs at loss backward

yoctta commented 2 years ago

It seems like something went wrong with NCCL. I have not met such problem

Elijah-Yi commented 2 years ago

It seems like something went wrong with NCCL. I have not met such problem

thanks for your reply, I have already, solved the problem of stuck. Line 303 in MAT.py, The gradient are not calculated and updated of loss_pack2=self.train_batch(Xaug,y,jump_aux=False)

yoctta commented 2 years ago

You will notice the commented lines around Line 303 in MAT.py. They would disable Sync_BN in the second forward for avoiding bugs in some versions of pytorch.