weiliu89 / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
4.77k stars 1.67k forks source link

Execute "examples/ssd/ssd_pascal.py" but got loss=nan at the begining [SOLVED] #543

Open henrychen1620 opened 7 years ago

henrychen1620 commented 7 years ago

Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help. Do not post such requests to Issues. Doing so interferes with the development of Caffe.

Please read the guidelines for contributing before submitting this issue.

Issue summary

Hi everyone, I have a question about training the SSD. I try to follow the instructions on the homepage of SSD and only change the training batch size to 25(due the lack of GPU memory). But I got the message at the begining of the training "Train net output #0: mbox_loss = nan". I'm wondering if anyone has similar problem since I didn't find any related post on github or other websites. Thanks a lot !

Steps to reproduce

If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.

Your system configuration

Operating system: Compiler: CUDA version (if applicable): CUDNN version (if applicable): BLAS: Python or MATLAB version (for pycaffe and matcaffe respectively):

weiliu89 commented 7 years ago

You could lower the initial learning rate and train with it for a few iterations. Then kill the training job and resume the training with original learning rate.

henrychen1620 commented 7 years ago

Thanks it helps. I change the learning rate to 0.0001 at the beginning.

YangBain commented 7 years ago

@henrychen1620 Hello, did you solve this problem. Now I have the same problem with you when training the ssd using my own datasets. If you solved it, could you give me some suggestion?

henrychen1620 commented 7 years ago

Hi, after tuning the learning rate to about 0.0001, the loss is not NAN anymore, but the result is not acceptable. So at the end I used the pre-trained model and didn't use my own dataset TAT

2017-06-05 14:37 GMT+08:00 YangBain notifications@github.com:

@henrychen1620 https://github.com/henrychen1620 Hello, henrychen1620, Recently, I have the same problems with you, when triaining the ssd using my own datasets. If possible, could you give me some suggestions? Thank you very much.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weiliu89/caffe/issues/543#issuecomment-306114704, or mute the thread https://github.com/notifications/unsubscribe-auth/AaeBg2AWLM8BoIGfyEOmLqXDkXloc7rtks5sA6IXgaJpZM4NBJ-A .

-- Best regard

From Po-Heng (Henry) Chen

Silicon Implementation and System Integration (Si2) Lab Dept. of Electronics Engineering & Institute of Electronics National Chiao Tung University

Si2 LAB Homepage (http://www.si2lab.org)

YangBain commented 7 years ago

@henrychen1620 Thank you very much.

abhisheksgumadi commented 7 years ago

@henrychen1620 You mentioned that the loss is not NAN initially when you used a lower lr. What do you mean when you say "So at the end I used the pre-trained model and didn't use my own dataset TAT" please? Do you mean you used the pre-trained model and increased the lr?

Coldmooon commented 6 years ago

Inappropriate settings of variance in prior_box_param could lead to NAN loss.