INFO:tensorflow:-------------------------------------------
INFO:tensorflow:seconds for training step 137: 0.715314865112
INFO:tensorflow:pgen_loss: 1.26063912376e-06
INFO:tensorflow:-------------------------------------------
INFO:tensorflow:Saving checkpoint to path /home/intradecoder-temporalattention-withpretraining/train/model.ckpt
INFO:tensorflow:global_step/sec: 1.08334
2019-01-02 02:23:08.042729: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x10223e38000 = {1, 0} Found Inf or NaN global norm.
Traceback (most recent call last):
File "src/run_summarization.py", line 795, in <module>
tf.app.run()
File "/home//.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "src/run_summarization.py", line 792, in main
seq2seq.main(unused_argv)
File "src/run_summarization.py", line 745, in main
self.setup_training()
File "src/run_summarization.py", line 342, in setup_training
self.run_training() # this is an infinite loop until interrupted
File "src/run_summarization.py", line 431, in run_training
results = self.model.run_train_steps(self.sess, batch, self.train_step)
File "/home/RL/RLSeq2Seq/src/model.py", line 723, in run_train_steps
return sess.run(to_return, feed_dict)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: shared_loss/summaries_pgen_loss/histogram
[[node shared_loss/summaries_pgen_loss/histogram (defined at /hom/RL/RLSeq2Seq/src/model.py:66) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](shared_loss/summaries_pgen_loss/histogram/tag, shared_loss/Mean/_3277)]]
Caused by op u'shared_loss/summaries_pgen_loss/histogram', defined at:
File "src/run_summarization.py", line 795, in <module>
tf.app.run()
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "src/run_summarization.py", line 792, in main
seq2seq.main(unused_argv)
File "src/run_summarization.py", line 745, in main
self.setup_training()
File "src/run_summarization.py", line 273, in setup_training
self.model.build_graph() # build the graph
File "/home/RL/RLSeq2Seq/src/model.py", line 468, in build_graph
self._add_shared_loss_op()
File "/home/RL/RLSeq2Seq/src/model.py", line 316, in _add_shared_loss_op
self.variable_summaries('pgen_loss', self._pgen_loss)
File "/home/RL/RLSeq2Seq/src/model.py", line 66, in variable_summaries
tf.summary.histogram('histogram', var)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/summary/summary.py", line 187, in histogram
tag=tag, values=values, name=scope)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary
"HistogramSummary", tag=tag, values=values, name=name)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init_ self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Nan in summary histogram for: shared_loss/summaries_pgen_loss/histogram
[[node shared_loss/summaries_pgen_loss/histogram (defined at /home/RL/RLSeq2Seq/src/model.py:66) = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](shared_loss/summaries_pgen_loss/histogram/tag, shared_loss/Mean/_3277)]]
Any idea how to solve this, I've tweaked learning rate, Batch SIze ? any other variable I need to play with?
I am running NMT and whenever I run this
CUDA_VISIBLE_DEVICES=0 python src/run_summarization.py --mode=train --data_path=~/finished_files/chunked/train_* --vocab_path=~/vocab --log_root=/home/ --exp_name=intradecoder-temporalattention-withpretraining --batch_size=80 --max_iter=20000 --use_temporal_attention=True --intradecoder=True --rl_training=False
I get
Any idea how to solve this, I've tweaked learning rate, Batch SIze ? any other variable I need to play with?