The model cannot converge when training

yuantailing / ctw-baseline

Baseline methods for [CTW dataset](https://ctwdataset.github.io/)

MIT License

330 stars 88 forks source link

The model cannot converge when training #11

Open dodgaga opened 6 years ago

dodgaga commented 6 years ago

Hi,

I just followed the instruction to train the SSD model, but the loss can't fall.

At the beginning, the base_lr= 0.001 but the loss=nan Then, I set a lower base_lr = 0.0001 , the loss drops from 40+ to ~10 ,and don't have any change. Next, I kill the training and set the base_lr=0.001 and resume to train, the loss = nan again. So, maybe the 0.01 is too big for the model, I lower learning rate which base_lr= 0.0004, but the loss is aways ~8.

how much the loss in the SSD model will finally be? and can you give me some advice to training the data?

yuantailing commented 6 years ago

I find logs and plot the loss. Here are

loss of iteration 0 - 120,000, and
loss of iteration 1,000 - 120,000.

yuantailing commented 6 years ago

Since there are ~320,000 subimages, about 320,000 / 14 = 23,000 iterations is 1 epoch. Don't pray loss falling down before 1 epoch.

base_lr = 0.001 is OK for batch_size = 14. It it do not converge, I think you may set base_lr = 0.0004 - 0.0008 and no need to modify it.

dodgaga commented 6 years ago

Thanks! u are right. After 40,000 interations, the loss tends to be 4. The final loss in your log is ~3, Don't you think it is a bit too high?

yuantailing commented 6 years ago

Maybe too high, maybe not comparable. Final AP should be close to YOLOv2.