I tried to start training the model by using the default configuration file for quora. This has use_cudnn=true. But it has run into some unexpected error, when I run the SentenceMatchTrainer.py file. The error is as follows:
(tensorflowGPU) D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src>python SentenceMatchTrainer.py --config_path "../configs/quora.sample.config"
Loading the configuration from ../configs/quora.sample.config
{'train_path': '../data/quora/train.tsv', 'dev_path': '../data/quora/dev.tsv',
'word_vec_path': '../data/quora/wordvec.txt', 'model_dir': 'quora_model', 'suffix': 'quora', 'fix_word_vec': True, 'isLower': True, 'max_sent_length': 50, 'max_char_per_word': 10,
'with_char': True, 'char_emb_dim': 20, 'char_lstm_dim': 40, 'batch_size': 60, 'max_epochs': 20, 'dropout_rate': 0.1, 'learning_rate': 0.0005, 'optimize_type': 'adam', 'lambda_l2': 0.0,
'grad_clipper': 10.0, 'context_layer_num': 1, 'context_lstm_dim': 100,
'aggregation_layer_num': 1, 'aggregation_lstm_dim': 100, 'with_full_match': True, 'with_maxpool_match': False, 'with_max_attentive_match': False, 'with_attentive_match': True,
'with_cosine': True, 'with_mp_cosine': True, 'cosine_MP_dim': 5, 'att_dim': 50, 'att_type': 'symmetric', 'highway_layer_num': 1,
'with_highway': True, 'with_match_highway': True,
'with_aggregation_highway': True, 'use_cudnn': True, 'with_moving_average': False}
Collecting words, chars and labels ...
Number of words: 104891
Number of chars: 1198
word_vocab shape is (106686, 300)
Number of labels: 2
Build SentenceMatchDataStream ...
Number of instances in trainDataStream: 384348
Number of batches in trainDataStream: 6406
Number of instances in devDataStream: 10000
Number of batches in devDataStream: 167
2019-05-30 00:41:22.120164: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-05-30 00:41:23.282409: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.48
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2019-05-30 00:41:23.289931: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2019-05-30 00:41:25.325066: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-30 00:41:25.329970: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0
2019-05-30 00:41:25.332505: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N
2019-05-30 00:41:25.337204: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4740 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
return fn(*args)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1305, in _run_fn
self._extend_graph()
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1340, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:
Colocation members and user-requested devices:
Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
Model/global_norm/L2Loss_38 (L2Loss)
[[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "SentenceMatchTrainer.py", line 257, in <module>
main(FLAGS)
File "SentenceMatchTrainer.py", line 191, in main
sess.run(initializer)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
run_metadata_ptr)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
run_metadata)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:
Colocation members and user-requested devices:
Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
Model/global_norm/L2Loss_38 (L2Loss)
[[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]
Caused by op 'Model/global_norm/L2Loss_38', defined at:
File "SentenceMatchTrainer.py", line 257, in <module>
main(FLAGS)
File "SentenceMatchTrainer.py", line 175, in main
is_training=True, options=FLAGS, global_step=global_step)
File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 10, in __init__
self.create_model_graph(num_classes, word_vocab, char_vocab, is_training, global_step=global_step)
File "D:\Back Up\Desktop\Setiment Analysis\synonym_paraphrase\BiMPM\src\SentenceMatchModelGraph.py", line 175, in create_model_graph
grads, _ = tf.clip_by_global_norm(grads, self.options.grad_clipper)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 240, in clip_by_global_norm
use_norm = global_norm(t_list, name)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\clip_ops.py", line 179, in global_norm
half_squared_norms.append(gen_nn_ops.l2_loss(v))
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 4679, in l2_loss
"L2Loss", t=t, name=name)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
op_def=op_def)
File "C:\Users\derp\AppData\Local\conda\conda\envs\tensorflowGPU\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Cannot colocate nodes 'Model/global_norm/L2Loss_38' and 'Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop' because no device type supports both of those nodes and the other nodes colocated with them.
Colocation Debug Info:
Colocation group had the following types and devices:
CudnnRNNBackprop: GPU
L2Loss:
Colocation members and user-requested devices:
Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop (CudnnRNNBackprop)
Model/global_norm/L2Loss_38 (L2Loss)
[[Node: Model/global_norm/L2Loss_38 = L2Loss[T=DT_FLOAT, _class=["loc:@Model...NNBackprop"]](Model/gradients/Model/aggregation_layer/right_layer-0/right_layer-0_cudnn_bi_lstm/CudnnRNN_grad/CudnnRNNBackprop:3)]]
When I set use_cudnn:false, the training starts without any problems. In this case, it is still using the GPU. I understand from the code that use_cudnn=true helps make use of the CudnnLSTM, but maybe the issue arises due to OS or the Tensorflow version. The details of the environment are:
OS : Windows10
Python: 3.6.8
Tensorflow_GPU version: 1.8
GPU: GTX 1060 6 GB
Can you tell where the problem lies ? In the meantime, I'll try to run this the program with default configs on an Ubuntu machine and see the results. Thanks !
I tried to start training the model by using the default configuration file for quora. This has
use_cudnn=true
. But it has run into some unexpected error, when I run theSentenceMatchTrainer.py
file. The error is as follows:When I set
use_cudnn:false
, the training starts without any problems. In this case, it is still using the GPU. I understand from the code thatuse_cudnn=true
helps make use of theCudnnLSTM
, but maybe the issue arises due to OS or the Tensorflow version. The details of the environment are: OS : Windows10 Python: 3.6.8 Tensorflow_GPU version: 1.8 GPU: GTX 1060 6 GBCan you tell where the problem lies ? In the meantime, I'll try to run this the program with default configs on an Ubuntu machine and see the results. Thanks !