zhen8838 / K210_Yolo_framework

Yolo v3 framework base on tensorflow, support multiple models, multiple datasets, any number of output layers, any number of anchors, model prune, and portable model to K210 !
MIT License
248 stars 73 forks source link

Train hanging at the end of Epoch 1/10 #37

Open henryzhao321 opened 3 years ago

henryzhao321 commented 3 years ago

Following the instructions in the README.md in section Train at Point 1:

make train MODEL=yolo_mobilev1 DEPTHMUL=0.75 MAXEP=10 ILR=0.001 DATASET=voc CLSNUM=20 IAA=False BATCH=16

It starts Epoch 1/10 and runs for about 2 hours with the ETA getting close to 0s, but stops/hangs at 6s. You can't Ctrl-C it and 'top' doesn't show any processor load. The log dir only has an args.txt and train directory.

Example output: 979/982 [============================>.] - ETA: 20s - loss: 39.1206 - l1_loss: 11.0472 - l2_loss: 27.5336 - l1_p: 0.1742 - l1_r: 0.0855 - l2_p: 0.0486 - l2_r: 0980/982 [============================>.] - ETA: 13s - loss: 39.1038 - l1_loss: 11.0427 - l2_loss: 27.5213 - l1_p: 0.1744 - l1_r: 0.0855 - l2_p: 0.0487 - l2_r: 0981/982 [============================>.] - ETA: 6s - loss: 39.0847 - l1_loss: 11.0408 - l2_loss: 27.5041 - l1_p: 0.1747 - l1_r: 0.0856 - l2_p: 0.0487 - l2_r: 0.0118

If I set MAXEP=1 it completes after 2 hours and I get the yolo_model.h5. I tried the "make inference" with this and it didn't seem to detect anything. I also tried the pre-built yolo_model.h5 in the asset directory and that works well. The instructions say to use MAXEP=10 so perhaps this is the why my model doesn't work? Why does it hang at the end of Epoch 1/10?

rogerkuo1981 commented 2 years ago

我也碰到同样的问题,请问您这个有没有解决掉啦?

henryzhao321 commented 2 years ago

Yes, I think so. I was using it under a ubuntu Virtual Machine and setting the CPU cores to 2 or more seemed to fix it.

BackMountainDevil commented 2 years ago

Please paste all output. Not enough to check what happed