UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters()

ys7yoo commented 3 years ago

python train.py --base_config korquad/bidaf

2021-04-24 16:41:17,090 (trainer.py:356): [INFO] - # Train Mode.
2021-04-24 16:41:17,519 (trainer.py:389): [INFO] -   Start - Batch Loss: 10.99170
/home/yyoo/torch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:665: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at  /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:915.)
  self.num_layers, self.dropout, self.training, self.bidirectional)

ys7yoo commented 3 years ago

https://github.com/naver/claf/issues/26

ys7yoo commented 3 years ago

Cannot train bidaf

2021-04-24 16:41:17,064 (experiment.py:327): [INFO] - use_gpu: True num_gpu: 1, distributed training: False, 16-bits training: False
2021-04-24 16:41:17,090 (trainer.py:356): [INFO] - # Train Mode.
2021-04-24 16:41:17,519 (trainer.py:389): [INFO] -   Start - Batch Loss: 10.99170
/home/yyoo/torch/lib/python3.6/site-packages/torch/nn/modules/rnn.py:665: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at  /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:915.)
  self.num_layers, self.dropout, self.training, self.bidirectional)
2021-04-24 16:42:19,708 (trainer.py:398): [INFO] -   Step: 100 Batch Loss: 9.49553  62.61820 sec
2021-04-24 16:43:18,987 (trainer.py:398): [INFO] -   Step: 200 Batch Loss: 9.00124  59.27854 sec
2021-04-24 16:44:18,107 (trainer.py:398): [INFO] -   Step: 300 Batch Loss: 8.03781  59.11870 sec
2021-04-24 16:45:20,132 (trainer.py:398): [INFO] -   Step: 400 Batch Loss: 8.11965  62.02481 sec
2021-04-24 16:46:20,668 (trainer.py:398): [INFO] -   Step: 500 Batch Loss: 7.88085  60.53543 sec
2021-04-24 16:47:21,651 (trainer.py:398): [INFO] -   Step: 600 Batch Loss: 7.74323  60.98295 sec
2021-04-24 16:48:22,440 (trainer.py:398): [INFO] -   Step: 700 Batch Loss: 7.67279  60.78864 sec
2021-04-24 16:49:23,494 (trainer.py:398): [INFO] -   Step: 800 Batch Loss: 7.32795  61.05298 sec
2021-04-24 16:50:26,077 (trainer.py:398): [INFO] -   Step: 900 Batch Loss: 7.36445  62.58257 sec
Traceback (most recent call last):
  File "train.py", line 10, in <module>
    experiment()
  File "/home/yyoo/src/claf/claf/learn/experiment.py", line 142, in __call__
    self.trainer.train_and_evaluate(train_loader, valid_loader, optimizer)
  File "/home/yyoo/src/claf/claf/learn/trainer.py", line 137, in train_and_evaluate
    eval_and_save_step_count=self.eval_and_save_step_count,
  File "/home/yyoo/src/claf/claf/learn/trainer.py", line 376, in _run_epoch
    output_dict = self.model(**inputs)
  File "/home/yyoo/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yyoo/src/claf/claf/model/reading_comprehension/bidaf.py", line 183, in forward
    context_encoded, context_mask, query_encoded, query_mask
  File "/home/yyoo/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/yyoo/src/claf/claf/modules/attention/bi_attention.py", line 29, in forward
    S = self._make_similiarity_matrix(c, q)  # (B, C_L, Q_L)
  File "/home/yyoo/src/claf/claf/modules/attention/bi_attention.py", line 50, in _make_similiarity_matrix
    concated_vector = torch.cat((c_aug, q_aug, c_q), dim=3)  # [h; u; h◦u]
RuntimeError: CUDA out of memory. Tried to allocate 4.54 GiB (GPU 0; 11.78 GiB total capacity; 2.81 GiB already allocated; 4.24 GiB free; 6.12 GiB reserved in total by PyTorch)

ys7yoo commented 3 years ago

https://discuss.pytorch.org/t/rnn-module-weights-are-not-part-of-single-contiguous-chunk-of-memory/6011/14 https://discuss.pytorch.org/t/why-and-how-to-flatten-lstm-parameters/53799

ys7yoo / claf

UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters() #8