zhiguowang / BiMPM

BiMPM: Bilateral Multi-Perspective Matching for Natural Language Sentences
Apache License 2.0
438 stars 150 forks source link

Error while training with CuDNN arg set as True #57

Open TLfERLS opened 5 years ago

TLfERLS commented 5 years ago

I tried to start training the model by using the default configuration file for quora. This has use_cudnn=true. But it has run into some unexpected error, when I run the SentenceMatchTrainer.py file. The error is as follows:

(tensorflowGPU) D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src>python SentenceMatchTrainer.py --config_path "../configs/quora.sample.config"
Loading the configuration from ../configs/quora.sample.config

{'train_path': '../data/quora/train.tsv', 'dev_path': '../data/quora/dev.tsv', 
'word_vec_path': '../data/quora/wordvec.txt', 'model_dir': 'quora_model', 'suffix': 'quora', 'fix_word_vec': True, 'isLower': True, 'max_sent_length': 50, 'max_char_per_word': 10, 
'with_char': True, 'char_emb_dim': 20, 'char_lstm_dim': 40, 'batch_size': 60, 'max_epochs': 20, 'dropout_rate': 0.1, 'learning_rate': 0.0005, 'optimize_type': 'adam', 'lambda_l2': 0.0,
 'grad_clipper': 10.0, 'context_layer_num': 1, 'context_lstm_dim': 100,
 'aggregation_layer_num': 1, 'aggregation_lstm_dim': 100, 'with_full_match': True, 'with_maxpool_match': False, 'with_max_attentive_match': False, 'with_attentive_match': True, 
'with_cosine': True, 'with_mp_cosine': True, 'cosine_MP_dim': 5, 'att_dim': 50, 'att_type': 'symmetric', 'highway_layer_num': 1, 
'with_highway': True, 'with_match_highway': True, 
'with_aggregation_highway': True, 'use_cudnn': True, 'with_moving_average': False}

Collecting words, chars and labels ...
Number of words: 104891
Number of chars: 1198
word_vocab shape is (106686, 300)
Number of labels: 2
Build SentenceMatchDataStream ...
Number of instances in trainDataStream: 384348
Number of batches in trainDataStream: 6406
Number of instances in devDataStream: 10000
Number of batches in devDataStream: 167
2019-05-30 00:41:22.120164: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-05-30 00:41:23.282409: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.48
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2019-05-30 00:41:23.289931: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2019-05-30 00:41:25.325066: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-30 00:41:25.329970: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929]      0
2019-05-30 00:41:25.332505: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0:   N
2019-05-30 00:41:25.337204: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4740 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
    return fn(*args)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1305, in _run_fn
    self._extend_graph()
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:

Colocation members and user-requested devices:
  Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
  Model/global_norm/L2Loss_38 (L2Loss)

         [[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "SentenceMatchTrainer.py", line 257, in <module>
    main(FLAGS)
  File "SentenceMatchTrainer.py", line 191, in main
    sess.run(initializer)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
    run_metadata_ptr)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
    run_metadata)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:

Colocation members and user-requested devices:
  Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
  Model/global_norm/L2Loss_38 (L2Loss)

         [[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]

Caused by op 'Model/global_norm/L2Loss_38', defined at:
  File "SentenceMatchTrainer.py", line 257, in <module>
    main(FLAGS)
  File "SentenceMatchTrainer.py", line 175, in main
    is_training=True, options=FLAGS, global_step=global_step)
  File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 10, in __init__
    self.create_model_graph(num_classes, word_vocab, char_vocab, is_training, global_step=global_step)
  File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 175, in create_model_graph
    grads, _ = tf.clip_by_global_norm(grads, self.options.grad_clipper)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 240, in clip_by_global_norm
    use_norm = global_norm(t_list, name)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 179, in global_norm
    half_squared_norms.append(gen_nn_ops.l2_loss(v))
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4679, in l2_loss
    "L2Loss", t=t, name=name)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
    op_def=op_def)
  File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:

Colocation members and user-requested devices:
  Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
  Model/global_norm/L2Loss_38 (L2Loss)

         [[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]

When I set use_cudnn:false, the training starts without any problems. In this case, it is still using the GPU. I understand from the code that use_cudnn=true helps make use of the CudnnLSTM, but maybe the issue arises due to OS or the Tensorflow version. The details of the environment are: OS : Windows10 Python: 3.6.8 Tensorflow_GPU version: 1.8 GPU: GTX 1060 6 GB

Can you tell where the problem lies ? In the meantime, I'll try to run this the program with default configs on an Ubuntu machine and see the results. Thanks !