ypeleg / llama

User-friendly LLaMA: Train or Run the model using PyTorch. Nothing else.
330 stars 60 forks source link

OOM with 80GB-A100 #8

Open kriskrisliu opened 1 year ago

kriskrisliu commented 1 year ago

Training leads to OOM even with an 80GB GPU card. Would you please give some advices ?

***** Running training *****
  Num examples = 1799
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 1799
  Number of trainable parameters = 6738423808
  0%|                                                                                                                                                               | 0/1799 [00:00<?, ?it/s]Traceback (most recent call last):
  File "training_example.py", line 45, in <module>
    Trainer(model = model,
  File "/data/anaconda3/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
  File "/data/anaconda3/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 1858, in _inner_training_loop
    self.optimizer.step()
  File "/data/anaconda3/envs/llama/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/data/anaconda3/envs/llama/lib/python3.8/site-packages/torch/optim/optimizer.py", line 113, in wrapper
    return func(*args, **kwargs)
  File "/data/anaconda3/envs/llama/lib/python3.8/site-packages/transformers/optimization.py", line 362, in step
    denom = exp_avg_sq.sqrt().add_(group["eps"])
RuntimeError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 79.18 GiB total capacity; 76.21 GiB already allocated; 162.38 MiB free; 77.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                               | 0/1799 [00:01<?, ?it/s]