Open chenweifu2008 opened 4 years ago
I met the same problem.
@CBIR-LL how to solve it buddy?
@chenweifu2008 I stop the training progress and resume training.
I don't think that is stucking. It should be the validation. Can you check nvidia-smi when you stuck to see if gpu are working.
I set the number of epoch to 10 and there are only 10 images in my validation datasets. In the last round of validation, the figure above appears [GPU takes up about 6G of memory, but the utilization rate is always 0.]. Is the program stuck or has it run out? @zylo117
In that case, that is another issue. The validation should be finished instantly but stuck at dataloader. Pytorch's dataloader will stuck for a moment at every epoch, or whenever it runs out of data. I guess it is reloading? But it's not crashing, given time, a few minutes at most, the training will continue I am also suffering the same problem here. For now, try smaller batchsize and smaller num_workers, it works.
Also, I just found out you are running pytorch on Windows, which is not recommended, especially when num_workers are greater than 0.
why my training stuck on some epoch so long didn't go to next epoch?