xssstory / SeqCo

Code for "Sequence Level Contrastive Learning for Text Summarization"
MIT License
21 stars 1 forks source link

RuntimeError: CUDA out of memory #4

Closed LieuMai closed 2 years ago

LieuMai commented 2 years ago

Describe the bug

When trying to train the model. I got this error. have searched related issues but cannot get the expected help.

Thank you in advance for any insights you can give.

Reproduction

  1. Command sh seqco_scripts/train_cnndm.sh
  2. Error
    
    (cross_attention): MultiheadAttention(
    (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
    )
    )
    | model backsum_transformer_bart_large, criterion LabelSmoothedCrossEntropyCriterion
    | num. model params: 630956032 (num. trained: 418883584)
    | training on 1 GPUs
    | max tokens per GPU = None and max sentences per GPU = 2
    | no existing checkpoint found /data/msra/sum_data//cnndm_bart/bart.large/model.pt
    | loading train data for epoch 0
    | loaded 15000 examples from: data/cnn_dm-bin/train.article-summary.article
    | loaded 15000 examples from: data/cnn_dm-bin/train.article-summary.summary
    | parallel-data/cnn_dm-bin train 15000 examples
    | loaded 15000 examples from: data/cnn_dm-bin/train.article-summary.article
    | backtranslate-article: data/cnn_dm-bin train 15000 examples
    | WARNING: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
    Traceback (most recent call last):
    File "train.py", line 344, in <module>
    cli_main()
    File "train.py", line 340, in cli_main
    main(args)
    File "train.py", line 77, in main
    extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
    File "/home/lieumai/TextSum/SeqCo/fairseq/checkpoint_utils.py", line 143, in load_checkpoint
    trainer.lr_step(epoch_itr.epoch)
    File "/home/lieumai/TextSum/SeqCo/fairseq/trainer.py", line 600, in lr_step
    self.lr_scheduler.step(epoch, val_loss)
    File "/home/lieumai/TextSum/SeqCo/fairseq/trainer.py", line 121, in lr_scheduler
    self._build_optimizer()  # this will initialize self._lr_scheduler
    File "/home/lieumai/TextSum/SeqCo/fairseq/trainer.py", line 143, in _build_optimizer
    self._optimizer = optim.FP16Optimizer.build_optimizer(self.args, params)
    File "/home/lieumai/TextSum/SeqCo/fairseq/optim/fp16_optimizer.py", line 207, in build_optimizer
    fp32_params = cls.build_fp32_params(params)
    File "/home/lieumai/TextSum/SeqCo/fairseq/optim/fp16_optimizer.py", line 67, in build_fp32_params
    fp32_params = params[0].new(0).float().new(total_param_size)
    RuntimeError: CUDA out of memory. Tried to allocate 1.56 GiB (GPU 0; 1.96 GiB total capacity; 1.18 GiB already allocated; 298.94 MiB free; 1.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
3. Results of nvidia-smi concerning my GPU

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | N/A 46C P8 N/A / N/A | 4MiB / 2048MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1259 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+


4. Results from torch (version: 1.12.1+cu102):
Is CUDA available? True
How many GPUs? - 1
What is the device name? - NVIDIA GeForce MX230
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB

Python 3.8.10
![Screenshot from 2022-10-16 20-36-58](https://user-images.githubusercontent.com/56626332/196038484-3d0f1904-7c94-459b-94d0-595d425bfeee.png)
xssstory commented 2 years ago

Thanks for your interest in our project!

The reason for "CUDA out of memory" is that 2G memory is too small to train the model.

Our model is trained on 8 V100 (32G) GPUs, you could try a GPU with more memory (i.e. 16GB or larger) and set "--max-sentences=1" to save memory.