Open ys7yoo opened 3 years ago
Cannot train bidaf
2021-04-24 16:41:17,064 (experiment.py:327): [INFO] - use_gpu: True num_gpu: 1, distributed training: False, 16-bits training: False
2021-04-24 16:41:17,090 (trainer.py:356): [INFO] - # Train Mode.
2021-04-24 16:41:17,519 (trainer.py:389): [INFO] - Start - Batch Loss: 10.99170
/home/yyoo/torch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:665: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:915.)
self.num_layers, self.dropout, self.training, self.bidirectional)
2021-04-24 16:42:19,708 (trainer.py:398): [INFO] - Step: 100 Batch Loss: 9.49553 62.61820 sec
2021-04-24 16:43:18,987 (trainer.py:398): [INFO] - Step: 200 Batch Loss: 9.00124 59.27854 sec
2021-04-24 16:44:18,107 (trainer.py:398): [INFO] - Step: 300 Batch Loss: 8.03781 59.11870 sec
2021-04-24 16:45:20,132 (trainer.py:398): [INFO] - Step: 400 Batch Loss: 8.11965 62.02481 sec
2021-04-24 16:46:20,668 (trainer.py:398): [INFO] - Step: 500 Batch Loss: 7.88085 60.53543 sec
2021-04-24 16:47:21,651 (trainer.py:398): [INFO] - Step: 600 Batch Loss: 7.74323 60.98295 sec
2021-04-24 16:48:22,440 (trainer.py:398): [INFO] - Step: 700 Batch Loss: 7.67279 60.78864 sec
2021-04-24 16:49:23,494 (trainer.py:398): [INFO] - Step: 800 Batch Loss: 7.32795 61.05298 sec
2021-04-24 16:50:26,077 (trainer.py:398): [INFO] - Step: 900 Batch Loss: 7.36445 62.58257 sec
Traceback (most recent call last):
File "train.py", line 10, in <module>
experiment()
File "/home/yyoo/src/claf/claf/learn/experiment.py", line 142, in __call__
self.trainer.train_and_evaluate(train_loader, valid_loader, optimizer)
File "/home/yyoo/src/claf/claf/learn/trainer.py", line 137, in train_and_evaluate
eval_and_save_step_count=self.eval_and_save_step_count,
File "/home/yyoo/src/claf/claf/learn/trainer.py", line 376, in _run_epoch
output_dict = self.model(**inputs)
File "/home/yyoo/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yyoo/src/claf/claf/model/reading_comprehension/bidaf.py", line 183, in forward
context_encoded, context_mask, query_encoded, query_mask
File "/home/yyoo/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yyoo/src/claf/claf/modules/attention/bi_attention.py", line 29, in forward
S = self._make_similiarity_matrix(c, q) # (B, C_L, Q_L)
File "/home/yyoo/src/claf/claf/modules/attention/bi_attention.py", line 50, in _make_similiarity_matrix
concated_vector = torch.cat((c_aug, q_aug, c_q), dim=3) # [h; u; h◦u]
RuntimeError: CUDA out of memory. Tried to allocate 4.54 GiB (GPU 0; 11.78 GiB total capacity; 2.81 GiB already allocated; 4.24 GiB free; 6.12 GiB reserved in total by PyTorch)