why my training stuck on some epoch?

zylo117 / Yet-Another-EfficientDet-Pytorch

The pytorch re-implement of the official efficientdet with SOTA performance in real time and pretrained weights.

GNU Lesser General Public License v3.0

5.2k stars 1.27k forks source link

why my training stuck on some epoch? #372

Open chenweifu2008 opened 4 years ago

chenweifu2008 commented 4 years ago

why my training stuck on some epoch so long didn't go to next epoch? Screenshot from 2020-06-04 14-09-29

Li-Lai commented 4 years ago

I met the same problem.

chenweifu2008 commented 4 years ago

@CBIR-LL how to solve it buddy?

Li-Lai commented 4 years ago

@chenweifu2008 I stop the training progress and resume training.

zylo117 commented 4 years ago

I don't think that is stucking. It should be the validation. Can you check nvidia-smi when you stuck to see if gpu are working.

Li-Lai commented 4 years ago

I set the number of epoch to 10 and there are only 10 images in my validation datasets. In the last round of validation, the figure above appears [GPU takes up about 6G of memory, but the utilization rate is always 0.]. Is the program stuck or has it run out? @zylo117

zylo117 commented 4 years ago

In that case, that is another issue. The validation should be finished instantly but stuck at dataloader. Pytorch's dataloader will stuck for a moment at every epoch, or whenever it runs out of data. I guess it is reloading? But it's not crashing, given time, a few minutes at most, the training will continue I am also suffering the same problem here. For now, try smaller batchsize and smaller num_workers, it works.

Also, I just found out you are running pytorch on Windows, which is not recommended, especially when num_workers are greater than 0.