whatcanisay-k / Center-and-Scale-Prediction-CSP-Pytorch

Pytorch implementation of CSP
22 stars 7 forks source link

Exploding Loss after 70 epochs #3

Closed Alx-Wo closed 4 years ago

Alx-Wo commented 5 years ago

Hi, I trained the network with the CityPersons dataset as described in the repo. Besides changing the batchsize from 4 to 2 because I'm training on a single GPU, and changing the lr to 2*e⁻⁴ as in the original paper, I did not change anything. Everything looked fine, loss was decreasing each epoch. However at epoch 74 from 150, my loss started to increase rapidly into NaN...

From time to time, numpy also gave a warning that the log function was dividing by zero aswell, but without crashing...

Did anyone experience similar errors?

whatcanisay-k commented 4 years ago

The data is randomly cropped from the training picture, and some wrong data may be generated at this time. You can end the training process and continue training on the last model.

WangWenhao0716 commented 4 years ago

@Alx-Wo I have solve the questions by: the paper ACSP: https://arxiv.org/abs/2002.09053 and codes: https://github.com/WangWenhao0716/Adapted-Center-and-Scale-Prediction will help.

WangWenhao0716 commented 4 years ago

Hi! I have received your comments! Thanks for your kindness. However, I am very busy recently, I cannot explain to you now. I will write a reply in details as soon as possible. Could you give me your email? In this way, we can talk about this question more convenient. Best Regards, Wenhao Wang, Beihang University. 

------------------ 原始邮件 ------------------ 发件人: "Alexander Wolpert"<notifications@github.com>; 发送时间: 2020年3月6日(星期五) 下午3:25 收件人: "zhangminwen/Center-and-Scale-Prediction-CSP-Pytorch"<Center-and-Scale-Prediction-CSP-Pytorch@noreply.github.com>; 抄送: "王文昊"<2609667791@qq.com>;"Mention"<mention@noreply.github.com>; 主题: Re: [zhangminwen/Center-and-Scale-Prediction-CSP-Pytorch] Exploding Loss after 70 epochs (#3)

Hi @WangWenhao0716 I've already had quick glance at your paper a few days ago, nice to hear from you :) Some points I'd like to mention:

You are clearly trying to directly improve an existing implementation in your reasearch. Therefore, why did you not work directly with the original keras implementation? In case of trying to publish your paper, in my opinion, you will have a very hard time to reason that your pytorch implementation does not contain any other bugs, and why or if the changes you induced even really work on the original implementation! (For example I don't think you are freezing your backbone? wheras the keras implementation does freeze the first 2 resnet50 layers...) At least you should include a comparison of your pytorch base model and the original keras base model and show that on average they produce the same results on the test sets that you want to evaluate on!

You say:

First, CSP [24] enlarges a picture with 3 channels into 64 channels through a 7x7 Conv layer. Certainly, BN layer, ReLU layer and Maxpool layer follow the Conv layer. In this way, a (3, 1024, 2048)(The bracket (, , ) denotes (#channels, height, width)) picture will be turned into a (64, 256, 512) one. Second, CSP [24] take 4 layers from ResNet-50

However all those layers come directly from the resnet50 whereas in your description it sounds like the first 4 layers (conv7x7, bn, relu, maxpool) are not from resnet50?

With the original keras implementation on multiple experiments I also only reached a MR of ~11.3% on average on CityPersons, just as you reported in your paper with the pytorch implementation.

The results from manipulating the aspect ratio are very interesting, and I did not think about it! Nice idea!

Switching the L1 Norm is also a nice little twist, which I did not expect to make such a difference.

In the end, I ended up switching to this https://github.com/lw396285v/CSP-pedestrian-detection-in-pytorch implementation of CSP as this implementation behaved very similar to the original keras implementation (numerical mostly stable)

After fixing some bugs with the teacher/student implementation and in the L1 norm calculation, I also achieved a 10.8% MR on CityPersons, without changing the L1 Loss, BNs or aspect ratios. Admittedly, this could have been a statistical bias as I only trained 2 experiments that both achieved a MR slightly below 11.0%, with the best result of 10.8%. However as this is not what I'm trying to work with in the end, I did not pursue those experiments any further. Sadly I'm not allowed to share any of my code as of now, maybe in 4-6 months I can make a public repo.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.