Weird error during training

kl2005ad commented 8 years ago

The project compiles fine but runs into a weird error when I start the pascal training using ssd_pascal.py without any change (except data path and gpus). Context: 1 GPU, cuda 7.0, cudnn v4 Below are the error messages:

... ... multibox_loss_param { loc_loss_type: SMOOTH_L1 conf_loss_type: SOFTMAX loc_weight: 1 num_classes: 21 share_location: true match_type: PER_PREDICTION overlap_threshold: 0.5 use_prior_for_matching: true background_label_id: 0 use_difficult_gt: true do_neg_mining: true neg_pos_ratio: 3 neg_overlap: 0.5 code_type: CENTER_SIZE } } I0713 20:19:47.714035 23178 layer_factory.hpp:77] Creating layer data I0713 20:19:47.714771 23178 net.cpp:91] Creating Layer data I0713 20:19:47.714797 23178 net.cpp:399] data -> data I0713 20:19:47.714933 23178 net.cpp:399] data -> label I0713 20:19:47.716934 23266 db_lmdb.cpp:35] Opened lmdb examples/VOC0712/VOC0712_trainval_lmdb I0713 20:19:47.742223 23178 annotated_data_layer.cpp:52] output data size: 32,3,300,300 I0713 20:19:47.838522 23178 net.cpp:141] Setting up data I0713 20:19:47.838660 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.838678 23178 net.cpp:148] Top shape: 1 1 1 8 (8) I0713 20:19:47.838685 23178 net.cpp:156] Memory required for data: 34560032 I0713 20:19:47.838709 23178 layer_factory.hpp:77] Creating layer data_data_0_split I0713 20:19:47.838824 23178 net.cpp:91] Creating Layer data_data_0_split I0713 20:19:47.838837 23178 net.cpp:425] data_data_0_split <- data I0713 20:19:47.838865 23178 net.cpp:399] data_data_0_split -> data_data_0_split_0 I0713 20:19:47.838891 23178 net.cpp:399] data_data_0_split -> data_data_0_split_1 I0713 20:19:47.838907 23178 net.cpp:399] data_data_0_split -> data_data_0_split_2 I0713 20:19:47.838927 23178 net.cpp:399] data_data_0_split -> data_data_0_split_3 I0713 20:19:47.838939 23178 net.cpp:399] data_data_0_split -> data_data_0_split_4 I0713 20:19:47.838951 23178 net.cpp:399] data_data_0_split -> data_data_0_split_5 I0713 20:19:47.838963 23178 net.cpp:399] data_data_0_split -> data_data_0_split_6 I0713 20:19:47.839156 23178 net.cpp:141] Setting up data_data_0_split I0713 20:19:47.839172 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.839181 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.839208 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.839220 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.839227 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.839236 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.839244 23178 net.cpp:148] Top shape: 32 3 300 300 (8640000) I0713 20:19:47.839251 23178 net.cpp:156] Memory required for data: 276480032 I0713 20:19:47.839257 23178 layer_factory.hpp:77] Creating layer conv1_1 I0713 20:19:47.839328 23178 net.cpp:91] Creating Layer conv1_1 I0713 20:19:47.839339 23178 net.cpp:425] conv1_1 <- data_data_0_split_0 I0713 20:19:47.839354 23178 net.cpp:399] conv1_1 -> conv1_1 F0713 20:19:47.872011 23267 math_functions.cpp:250] Check failed: a <= b (1 vs. 1) * Check failure stack trace: * @ 0x7f1ec1a0f4dd google::LogMessage::Fail() @ 0x7f1ec1a115ef google::LogMessage::SendToLog() @ 0x7f1ec1a0f0cc google::LogMessage::Flush() @ 0x7f1ec1a11e8d google::LogMessageFatal::~LogMessageFatal() @ 0x7f1ec2290c1f caffe::caffe_rng_uniform<>() @ 0x7f1ec22911b4 caffe::SampleBBox() @ 0x7f1ec229165d caffe::GenerateSamples() @ 0x7f1ec2291861 caffe::GenerateBatchSamples() @ 0x7f1ec211033d caffe::AnnotatedDataLayer<>::load_batch() @ 0x7f1ec219d4b1 caffe::BasePrefetchingDataLayer<>::InternalThreadEntry() @ 0x7f1ec210d176 caffe::InternalThread::entry() @ 0x7f1eb9331ce9 (unknown) @ 0x7f1eb4ea1e9a start_thread @ 0x7f1ec100d36d (unknown) Aborted

I checked that line 250 of math_functions.cpp is CHECK_LE(a, b); Isn't it supposed to be fine when a=1 and b=1 ?

weiliu89 commented 8 years ago

What if try to run it again? Might due to some float precision issue..

kl2005ad commented 8 years ago

@weiliu89 tried a couple times... only observe such issue on this machine. I tried ssd on AWS previously and it worked fine. So it might just be some problem of linear algebra libs?

weiliu89 commented 8 years ago

I am not sure. A quick test is to change float in this lines to double and see if it is due to precision issue.

fnzhan commented 7 years ago

Hi, how do you solve this problem? I have the same error.

weiliu89 / caffe

Weird error during training #41