not convergence, why? - Githubissues

robotzheng commented 5 years ago

19-07-01 09:44:55.725 - INFO: <epoch:221, iter: 448,100, lr:(2.583e-07,)>l_pix: 8.0168e+04 19-07-01 09:45:46.968 - INFO: <epoch:221, iter: 448,200, lr:(2.421e-07,)>l_pix: 6.3542e+04 19-07-01 09:46:39.644 - INFO: <epoch:222, iter: 448,300, lr:(2.267e-07,)>l_pix: 6.8591e+04 19-07-01 09:47:30.610 - INFO: <epoch:222, iter: 448,400, lr:(2.123e-07,)>l_pix: 6.2289e+04 19-07-01 09:48:21.469 - INFO: <epoch:222, iter: 448,500, lr:(1.987e-07,)>l_pix: 6.7909e+04 19-07-01 09:49:12.473 - INFO: <epoch:222, iter: 448,600, lr:(1.859e-07,)>l_pix: 5.4850e+04 19-07-01 09:50:03.302 - INFO: <epoch:222, iter: 448,700, lr:(1.741e-07,)>l_pix: 7.6995e+04 19-07-01 09:50:54.809 - INFO: <epoch:222, iter: 448,800, lr:(1.631e-07,)>l_pix: 6.6559e+04 19-07-01 09:51:45.599 - INFO: <epoch:222, iter: 448,900, lr:(1.531e-07,)>l_pix: 5.5888e+04 19-07-01 09:52:36.882 - INFO: <epoch:222, iter: 449,000, lr:(1.439e-07,)>l_pix: 5.7516e+04 19-07-01 09:53:27.801 - INFO: <epoch:222, iter: 449,100, lr:(1.355e-07,)>l_pix: 6.3166e+04 19-07-01 09:54:18.680 - INFO: <epoch:222, iter: 449,200, lr:(1.281e-07,)>l_pix: 5.9250e+04 19-07-01 09:55:09.492 - INFO: <epoch:222, iter: 449,300, lr:(1.215e-07,)>l_pix: 7.9755e+04 19-07-01 09:56:00.916 - INFO: <epoch:222, iter: 449,400, lr:(1.158e-07,)>l_pix: 7.5331e+04 19-07-01 09:56:51.746 - INFO: <epoch:222, iter: 449,500, lr:(1.110e-07,)>l_pix: 6.1268e+04 19-07-01 09:57:42.739 - INFO: <epoch:222, iter: 449,600, lr:(1.070e-07,)>l_pix: 5.8020e+04 19-07-01 09:58:34.047 - INFO: <epoch:222, iter: 449,700, lr:(1.039e-07,)>l_pix: 7.0684e+04 19-07-01 09:59:25.079 - INFO: <epoch:222, iter: 449,800, lr:(1.018e-07,)>l_pix: 6.1002e+04 19-07-01 10:00:16.479 - INFO: <epoch:222, iter: 449,900, lr:(1.004e-07,)>l_pix: 6.2216e+04 19-07-01 10:01:07.545 - INFO: <epoch:222, iter: 450,000, lr:(4.000e-04,)>l_pix: 6.2134e+04 19-07-01 10:01:07.546 - INFO: Saving models and training states. 19-07-01 10:01:58.558 - INFO: <epoch:222, iter: 450,100, lr:(4.000e-04,)>l_pix: 6.8700e+04 19-07-01 10:02:49.626 - INFO: <epoch:222, iter: 450,200, lr:(4.000e-04,)>l_pix: 5.7384e+04 19-07-01 10:03:42.350 - INFO: <epoch:223, iter: 450,300, lr:(4.000e-04,)>l_pix: 5.6992e+04 19-07-01 10:04:33.524 - INFO: <epoch:223, iter: 450,400, lr:(4.000e-04,)>l_pix: 5.5664e+04 19-07-01 10:05:24.229 - INFO: <epoch:223, iter: 450,500, lr:(4.000e-04,)>l_pix: 7.6057e+04 19-07-01 10:06:15.174 - INFO: <epoch:223, iter: 450,600, lr:(4.000e-04,)>l_pix: 7.2010e+04 19-07-01 10:07:06.118 - INFO: <epoch:223, iter: 450,700, lr:(4.000e-04,)>l_pix: 6.0641e+04 19-07-01 10:07:56.997 - INFO: <epoch:223, iter: 450,800, lr:(4.000e-04,)>l_pix: 5.9802e+04 19-07-01 10:08:47.783 - INFO: <epoch:223, iter: 450,900, lr:(4.000e-04,)>l_pix: 6.6285e+04 19-07-01 10:09:38.584 - INFO: <epoch:223, iter: 451,000, lr:(4.000e-04,)>l_pix: 6.3312e+04 19-07-01 10:10:29.622 - INFO: <epoch:223, iter: 451,100, lr:(3.999e-04,)>l_pix: 6.5559e+04 19-07-01 10:11:20.408 - INFO: <epoch:223, iter: 451,200, lr:(3.999e-04,)>l_pix: 7.3202e+04 19-07-01 10:12:11.144 - INFO: <epoch:223, iter: 451,300, lr:(3.999e-04,)>l_pix: 7.1605e+04 19-07-01 10:13:02.022 - INFO: <epoch:223, iter: 451,400, lr:(3.999e-04,)>l_pix: 6.4014e+04 19-07-01 10:13:53.078 - INFO: <epoch:223, iter: 451,500, lr:(3.999e-04,)>l_pix: 6.2185e+04 19-07-01 10:14:44.101 - INFO: <epoch:223, iter: 451,600, lr:(3.999e-04,)>l_pix: 5.7227e+04 19-07-01 10:15:34.804 - INFO: <epoch:223, iter: 451,700, lr:(3.999e-04,)>l_pix: 7.6518e+04 19-07-01 10:16:25.789 - INFO: <epoch:223, iter: 451,800, lr:(3.999e-04,)>l_pix: 6.0232e+04 19-07-01 10:17:16.691 - INFO: <epoch:223, iter: 451,900, lr:(3.998e-04,)>l_pix: 7.3397e+04

Xingyb14 commented 5 years ago

met the same problem when training with the default 'train_EDVR_woTSA_M.yml'.

robotzheng commented 5 years ago

Train with the config train_EDVR_woTSA_M.yml. [ Example of training log ], [ Pre-trained model ] Train with the config train_EDVR_M.yml, whose initialization is from the model of Step 1. [ Example of training log ], [ Pre-trained model ] the logs are not convergenced!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Marshall-yao commented 5 years ago

Hi, Xingyb14 and robotzheng. I want to consult you with a question when i see that you have run this code. I have met a problem about no enough memory when i run create_lmdb_mp.py of REDS. Could you share the train_sharp_wval.lmdb ?

Thanks a lot.

xinntao commented 5 years ago

@robotzheng @Xingyb14 The loss fluctuates and it seems that they do not converge. But you can evaluate the checkpoints and you will find that the performance (PSNR) actually increases.

robotzheng commented 5 years ago

@xinntao Thanks a lot, I will be check it., but in math, it is not good.

xinntao commented 5 years ago

What did you mean by saying "in math, it is not good"?

robotzheng commented 5 years ago

Loss function can not be seen optimized by the logs. Thanks a lot again！

Xingyb14 commented 5 years ago

@yaolugithub It's too large to share...sorry for that. You can wait for the low memory consumption version: https://github.com/xinntao/EDVR/issues/39#issuecomment-507649599

Xingyb14 commented 5 years ago

@xinntao thx for reply.

mikeseven commented 5 years ago

I did a few runs with original values and tested with different learning rates and batch sizes with 6 v100 GPUs. I'm retesting with your latest changes to create lmdbs. You can see my modifs in my fork.

The behavior is rather similar ie it doesn't seem to converge with high oscillations. Something doesn't smell right as if the model wasn't learning much. It also doesn't seem to make a difference whether training on 600k or 100k iterations, even less may suffice??

siyuhsu commented 5 years ago

I did a few runs with original values and tested with different learning rates and batch sizes with 6 v100 GPUs. I'm retesting with your latest changes to create lmdbs. You can see my modifs in my fork.

The behavior is rather similar ie it doesn't seem to converge with high oscillations. Something doesn't smell right as if the model wasn't learning much. It also doesn't seem to make a difference whether training on 600k or 100k iterations, even less may suffice??

I also have this situation, like this:

mikeseven commented 5 years ago

Yes same here. I think the medium size model is overfitting and what we see is just noise.

xinntao commented 5 years ago

@robotzheng @Xingyb14 @mikeseven @siyux As for the problem of The training loss curve of EDVR does not converge, I give some explanations in the FAQs.

mikeseven commented 5 years ago

Thanks for your response and explanation. It makes sense and I have been testing with all GPUs l_pix. Still, each GPU shows similar highly fluctuating loss that a reduce sum doesn't change much. Interestingly, yes, PNSR improves. Anyway, there is something weird going on, not sure what yet but it would be helpful to report the usual metrics, even if it slows down training a bit.

xinntao / EDVR

not convergence, why? #53