Crash while training - Githubissues

shouyinz commented 5 years ago

Hi,

I'm trying to train my own model according to your implementation. Since I encounter some gradient error cause Nan or Inf error.

Below is the log when training crash

2018-10-31 02:12:13.561786: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7fa89e206200 = {1, 0} Found Inf or NaN global norm.
Traceback (most recent call last):
  File "TrainingModel.py", line 112, in <module>
    model.label_holder: label_flip})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 887, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1110, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
     [[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]

Caused by op u'VerifyFinite/CheckNumerics', defined at:
  File "TrainingModel.py", line 42, in <module>
    grads, _ = tf.clip_by_global_norm(tf.gradients(model.Loss_Mean, tvars), max_grad_norm)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/clip_ops.py", line 259, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/numerics.py", line 45, in verify_tensor_all_finite
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
     [[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]

Could you help to comment about it. I also wonder the tensorflow version or more detail for training environment of pre-build model.

Thanks.

zhimingluo commented 5 years ago

I need to check what cause this NaN issue, and will reply you later

james12695 commented 5 years ago

And any constraint for the training dataset to train our own model ? e.x. the target mask area must be larger than certain proportion to the label image?

asgq123 commented 5 years ago

I also meet this problem .I use dataset MSRA10K to train the model , but the loss become NAN at the third
epoch. Have you solved this problem?

zhimingluo / NLDF

Crash while training #3