yaserkl / RLSeq2Seq

Deep Reinforcement Learning For Sequence to Sequence Models
https://arxiv.org/abs/1805.09461
MIT License
767 stars 160 forks source link

Facing a issue while training for NMT ? #24

Closed yashkumaratri closed 5 years ago

yashkumaratri commented 5 years ago

I am running NMT and whenever I run this

CUDA_VISIBLE_DEVICES=0 python src/run_summarization.py --mode=train --data_path=~/finished_files/chunked/train_* --vocab_path=~/vocab --log_root=/home/ --exp_name=intradecoder-temporalattention-withpretraining --batch_size=80 --max_iter=20000 --use_temporal_attention=True --intradecoder=True --rl_training=False

I get

INFO:tensorflow:-------------------------------------------
INFO:tensorflow:seconds for training step 137: 0.715314865112
INFO:tensorflow:pgen_loss: 1.26063912376e-06
INFO:tensorflow:-------------------------------------------
INFO:tensorflow:Saving checkpoint to path /home/intradecoder-temporalattention-withpretraining/train/model.ckpt
INFO:tensorflow:global_step/sec: 1.08334
2019-01-02 02:23:08.042729: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x10223e38000 = {1, 0} Found Inf or NaN global norm.
Traceback (most recent call last):
  File "src/run_summarization.py", line 795, in <module>
    tf.app.run()
  File "/home//.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "src/run_summarization.py", line 792, in main
    seq2seq.main(unused_argv)
  File "src/run_summarization.py", line 745, in main
    self.setup_training()
  File "src/run_summarization.py", line 342, in setup_training
    self.run_training() # this is an infinite loop until interrupted
  File "src/run_summarization.py", line 431, in run_training
    results = self.model.run_train_steps(self.sess, batch, self.train_step)
  File "/home/RL/RLSeq2Seq/src/model.py", line 723, in run_train_steps
    return sess.run(to_return, feed_dict)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: shared_loss/summaries_pgen_loss/histogram
         [[node shared_loss/summaries_pgen_loss/histogram (defined at /hom/RL/RLSeq2Seq/src/model.py:66)  = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](shared_loss/summaries_pgen_loss/histogram/tag, shared_loss/Mean/_3277)]]

Caused by op u'shared_loss/summaries_pgen_loss/histogram', defined at:
  File "src/run_summarization.py", line 795, in <module>
    tf.app.run()
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "src/run_summarization.py", line 792, in main
    seq2seq.main(unused_argv)
  File "src/run_summarization.py", line 745, in main
    self.setup_training()
  File "src/run_summarization.py", line 273, in setup_training
    self.model.build_graph() # build the graph
  File "/home/RL/RLSeq2Seq/src/model.py", line 468, in build_graph
    self._add_shared_loss_op()
  File "/home/RL/RLSeq2Seq/src/model.py", line 316, in _add_shared_loss_op
    self.variable_summaries('pgen_loss', self._pgen_loss)
  File "/home/RL/RLSeq2Seq/src/model.py", line 66, in variable_summaries
    tf.summary.histogram('histogram', var)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/summary/summary.py", line 187, in histogram
    tag=tag, values=values, name=scope)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 284, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init_   self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: shared_loss/summaries_pgen_loss/histogram
         [[node shared_loss/summaries_pgen_loss/histogram (defined at /home/RL/RLSeq2Seq/src/model.py:66)  = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](shared_loss/summaries_pgen_loss/histogram/tag, shared_loss/Mean/_3277)]]

Any idea how to solve this, I've tweaked learning rate, Batch SIze ? any other variable I need to play with?