Open henrychen1620 opened 7 years ago
You could lower the initial learning rate and train with it for a few iterations. Then kill the training job and resume the training with original learning rate.
Thanks it helps. I change the learning rate to 0.0001 at the beginning.
@henrychen1620 Hello, did you solve this problem. Now I have the same problem with you when training the ssd using my own datasets. If you solved it, could you give me some suggestion?
Hi, after tuning the learning rate to about 0.0001, the loss is not NAN anymore, but the result is not acceptable. So at the end I used the pre-trained model and didn't use my own dataset TAT
2017-06-05 14:37 GMT+08:00 YangBain notifications@github.com:
@henrychen1620 https://github.com/henrychen1620 Hello, henrychen1620, Recently, I have the same problems with you, when triaining the ssd using my own datasets. If possible, could you give me some suggestions? Thank you very much.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weiliu89/caffe/issues/543#issuecomment-306114704, or mute the thread https://github.com/notifications/unsubscribe-auth/AaeBg2AWLM8BoIGfyEOmLqXDkXloc7rtks5sA6IXgaJpZM4NBJ-A .
-- Best regard
Si2 LAB Homepage (http://www.si2lab.org)
@henrychen1620 Thank you very much.
@henrychen1620 You mentioned that the loss is not NAN initially when you used a lower lr. What do you mean when you say "So at the end I used the pre-trained model and didn't use my own dataset TAT" please? Do you mean you used the pre-trained model and increased the lr?
Inappropriate settings of variance
in prior_box_param
could lead to NAN loss.
Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help. Do not post such requests to Issues. Doing so interferes with the development of Caffe.
Please read the guidelines for contributing before submitting this issue.
Issue summary
Hi everyone, I have a question about training the SSD. I try to follow the instructions on the homepage of SSD and only change the training batch size to 25(due the lack of GPU memory). But I got the message at the begining of the training "Train net output #0: mbox_loss = nan". I'm wondering if anyone has similar problem since I didn't find any related post on github or other websites. Thanks a lot !
Steps to reproduce
If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.
Your system configuration
Operating system: Compiler: CUDA version (if applicable): CUDNN version (if applicable): BLAS: Python or MATLAB version (for pycaffe and matcaffe respectively):