yangxue0827 / FPN_Tensorflow

A Tensorflow implementation of FPN detection framework.
416 stars 150 forks source link

InvalidArgumentError (see above for traceback): LossTensor is inf or nan #2

Open Kongsea opened 6 years ago

Kongsea commented 6 years ago

After running some steps, it raised the following error, please check it. Thank you.

Caused by op u'train_op/CheckNumerics', defined at: File "train.py", line 228, in train() File "train.py", line 138, in train train_op = slim.learning.create_train_op(total_loss, optimizer, global_step) # rpn_total_loss, File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 439, in create_train_op check_numerics=check_numerics) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/training/python/training/training.py", line 464, in create_train_op 'LossTensor is inf or nan') File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 569, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values [[Node: train_op/CheckNumerics = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

yangxue0827 commented 6 years ago

Please adjust the parameters, such as learning rate.

Kongsea commented 6 years ago

Thank you. Now it works well.

Kongsea commented 6 years ago

It raised this error again even after I set the LR to 0.00005. Please help to check the reason. Thank you.

Kongsea commented 6 years ago

Besides, during training, the fast_rcnn_loc_loss is a constant: 0.0 @yangxue0827

yangxue0827 commented 6 years ago

Yes, the loss in the second stage is often zero at the beginning. However, as the first stage of training reaches a certain stage, high quality proposals are provided, and the second stage loss will change.

yangxue0827 commented 6 years ago

If your loss is always zero, it is because the first phase is not well trained. I suggest you adjust the scale and ratio of anchors.

Kongsea commented 6 years ago

I changed ANCHOR_SCALES from [1.] to [2-2, 2-1, 1.], now it works.

However, the fast_rcnn_loc_loss is still always be 0.0 after 10k iterations of training. Does it make sense?

Kongsea commented 6 years ago

I have finally trained the model successfully. The losses of the model are finally reduced to a normal level and the detection results using the final model are also good. However, the total_loss is still very big. I don't know if it makes sense. The output of the terminal is as follows:

                       ```step38181 image_name:200439.jpg
                       rpn_loc_loss:0.0025 | rpn_cla_loss:0.0001 | rpn_total_loss:0.0026
                       fast_rcnn_loc_loss:0.0006 | fast_rcnn_cla_loss:0.0019 | fast_rcnn_total_loss:0.0025
                       total_loss:0.6207 | pre_cost_time:0.5411s```

Could you please help to check if it's correct. Thank you. @yangxue0827

yangxue0827 commented 6 years ago

This is normal, and the total loss includes not only the losses shown above, but also loss from weight decay.

Kongsea commented 6 years ago

Thank you for your rapid answer.

FantasticEthan commented 6 years ago

@Kongsea Which parameters have you changed in your finally train circle? I met the same problems that the TensorLoss is Nan. I hava already set LR to 1e-5.

Kongsea commented 6 years ago

As I mentioned above, I have changed ANCHOR_SCALES. You can have a try. Besides, you should carefully check your data. Maybe the error was caused by your data. For example, if xmin, ymin are less than 0 or xmax, ymax are larger than height or width of images.

TVXQ20031226 commented 6 years ago

I changed ANCHOR_SCALES from [1.] to [2-2, 2-1, 1.], now it works.

However, the fast_rcnn_loc_loss is still always be 0.0 after 10k iterations of training. Does it make sense?

您好,我也碰到这个问题了,我修改了anchor scale和LR都没用,也检查了数据集,不存在超界现象,但是还是没解决,请问有什么好的建议吗???谢谢~~~ @Kongsea @yangxue0827

yangxue0827 commented 5 years ago

Recommend improved code: https://github.com/DetectionTeamUCAS/FPN_Tensorflow. @Kongsea @FantasticEthan @TVXQ20031226

Kongsea commented 5 years ago

Thank you.

Rajamohanreddyai commented 5 years ago

INFO:tensorflow:global step 124368: loss = 0.1140 (0.138 sec/step) INFO:tensorflow:global step 124368: loss = 0.1140 (0.138 sec/step) INFO:tensorflow:global step 124369: loss = 0.0992 (0.130 sec/step) INFO:tensorflow:global step 124369: loss = 0.0992 (0.130 sec/step) INFO:tensorflow:global step 124370: loss = 358068508261867004967067454537728.0000 (0.131 sec/step) INFO:tensorflow:global step 124370: loss = 358068508261867004967067454537728.0000 (0.131 sec/step) INFO:tensorflow:global step 124371: loss = 0.0627 (0.133 sec/step) INFO:tensorflow:global step 124371: loss = 0.0627 (0.133 sec/step) INFO:tensorflow:global step 124372: loss = 0.2287 (0.133 sec/step) INFO:tensorflow:global step 124372: loss = 0.2287 (0.133 sec/step) INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "legacy/train.py", line 185, in tf.app.run() File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 306, in new_func return func(*args, *kwargs) File "legacy/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/home/john/Desktop/models/research/object_detection/legacy/trainer.py", line 322, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, **kwargs) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "legacy/train.py", line 185, in tf.app.run() File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 306, in new_func return func(*args, *kwargs) File "legacy/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/home/john/Desktop/models/research/object_detection/legacy/trainer.py", line 322, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, **kwargs) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Traceback (most recent call last): File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[{{node CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "legacy/train.py", line 185, in tf.app.run() File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 306, in new_func return func(*args, **kwargs) File "legacy/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/home/john/Desktop/models/research/object_detection/legacy/trainer.py", line 416, in train saver=saver) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 770, in train sess, train_op, global_step, train_step_kwargs) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'CheckNumerics', defined at: File "legacy/train.py", line 185, in tf.app.run() File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 306, in new_func return func(*args, *kwargs) File "legacy/train.py", line 180, in main graph_hook_fn=graph_rewriter_fn) File "/home/john/Desktop/models/research/object_detection/legacy/trainer.py", line 322, in train total_loss = tf.check_numerics(total_loss, 'LossTensor is inf or nan.') File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics "CheckNumerics", tensor=tensor, message=message, name=name) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, **kwargs) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/home/john/anaconda3/envs/tfp35-gpu/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan. : Tensor had NaN values [[node CheckNumerics (defined at /home/john/Desktop/models/research/object_detection/legacy/trainer.py:322) = CheckNumericsT=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Rajamohanreddyai commented 5 years ago

Hi Thank you for advance, could you please help me out to resolve the above error,.,. Thanks once again

Rajamohanreddyai commented 5 years ago

model : faster_rcnn_inception_v2_pets.config Graphic card: Titan XP Nvidia Driver version : 384.130 Tensorflow version: 1.12.0 CUDA: 9.0 CUDNN: 7.3.0