Open Kongsea opened 6 years ago
Please adjust the parameters, such as learning rate.
Thank you. Now it works well.
It raised this error again even after I set the LR to 0.00005. Please help to check the reason. Thank you.
Besides, during training, the fast_rcnn_loc_loss is a constant: 0.0 @yangxue0827
Yes, the loss in the second stage is often zero at the beginning. However, as the first stage of training reaches a certain stage, high quality proposals are provided, and the second stage loss will change.
If your loss is always zero, it is because the first phase is not well trained. I suggest you adjust the scale and ratio of anchors.
I changed ANCHOR_SCALES from [1.] to [2-2, 2-1, 1.], now it works.
However, the fast_rcnn_loc_loss is still always be 0.0 after 10k iterations of training. Does it make sense?
I have finally trained the model successfully. The losses of the model are finally reduced to a normal level and the detection results using the final model are also good. However, the total_loss is still very big. I don't know if it makes sense. The output of the terminal is as follows:
```step38181 image_name:200439.jpg
rpn_loc_loss:0.0025 | rpn_cla_loss:0.0001 | rpn_total_loss:0.0026
fast_rcnn_loc_loss:0.0006 | fast_rcnn_cla_loss:0.0019 | fast_rcnn_total_loss:0.0025
total_loss:0.6207 | pre_cost_time:0.5411s```
Could you please help to check if it's correct. Thank you. @yangxue0827
This is normal, and the total loss includes not only the losses shown above, but also loss from weight decay.
Thank you for your rapid answer.
@Kongsea Which parameters have you changed in your finally train circle? I met the same problems that the TensorLoss is Nan. I hava already set LR to 1e-5.
As I mentioned above, I have changed ANCHOR_SCALES. You can have a try. Besides, you should carefully check your data. Maybe the error was caused by your data. For example, if xmin, ymin are less than 0 or xmax, ymax are larger than height or width of images.
I changed ANCHOR_SCALES from [1.] to [2-2, 2-1, 1.], now it works.
However, the fast_rcnn_loc_loss is still always be 0.0 after 10k iterations of training. Does it make sense?
您好,我也碰到这个问题了,我修改了anchor scale和LR都没用,也检查了数据集,不存在超界现象,但是还是没解决,请问有什么好的建议吗???谢谢~~~ @Kongsea @yangxue0827
Recommend improved code: https://github.com/DetectionTeamUCAS/FPN_Tensorflow. @Kongsea @FantasticEthan @TVXQ20031226
Thank you.
INFO:tensorflow:global step 124368: loss = 0.1140 (0.138 sec/step) INFO:tensorflow:global step 124368: loss = 0.1140 (0.138 sec/step) INFO:tensorflow:global step 124369: loss = 0.0992 (0.130 sec/step) INFO:tensorflow:global step 124369: loss = 0.0992 (0.130 sec/step) INFO:tensorflow:global step 124370: loss = 358068508261867004967067454537728.0000 (0.131 sec/step) INFO:tensorflow:global step 124370: loss = 358068508261867004967067454537728.0000 (0.131 sec/step) INFO:tensorflow:global step 124371: loss = 0.0627 (0.133 sec/step) INFO:tensorflow:global step 124371: loss = 0.0627 (0.133 sec/step) INFO:tensorflow:global step 124372: loss = 0.2287 (0.133 sec/step) INFO:tensorflow:global step 124372: loss = 0.2287 (0.133 sec/step) INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op 'CheckNumerics', defined at:
File "legacy/train.py", line 185, in
InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op 'CheckNumerics', defined at:
File "legacy/train.py", line 185, in
InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Traceback (most recent call last): File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[{{node CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "legacy/train.py", line 185, in
Caused by op 'CheckNumerics', defined at:
File "legacy/train.py", line 185, in
InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Hi Thank you for advance, could you please help me out to resolve the above error,.,. Thanks once again
model : faster_rcnn_inception_v2_pets.config Graphic card: Titan XP Nvidia Driver version : 384.130 Tensorflow version: 1.12.0 CUDA: 9.0 CUDNN: 7.3.0
After running some steps, it raised the following error, please check it. Thank you.
Caused by op u'train_op/CheckNumerics', defined at: File "train.py", line 228, in
train()
File "train.py", line 138, in train
train_op = slim.learning.create_train_op(total_loss, optimizer, global_step) # rpn_total_loss,
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 439, in create_train_op
check_numerics=check_numerics)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/training.py", line 464, in create_train_op
'LossTensor is inf or nan')
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 569, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values [[Node: train_op/CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]