I have tried with both a100 40G and a100 80G same results.
log:
O|trainer.py:1516] 2023-07-19 15:14:59,214 >> Running training
[INFO|trainer.py:1517] 2023-07-19 15:14:59,214 >> Num examples = 1416768
[INFO|trainer.py:1518] 2023-07-19 15:14:59,214 >> Num Epochs = 10
[INFO|trainer.py:1519] 2023-07-19 15:14:59,214 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1520] 2023-07-19 15:14:59,214 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1521] 2023-07-19 15:14:59,214 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1522] 2023-07-19 15:14:59,214 >> Total optimization steps = 221370
File "/opt/conda/envs/instructor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/instructor/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 534, in forward
attn_weights, p=self.dropout, training=self.training
File "/opt/conda/envs/instructor/lib/python3.7/site-packages/torch/nn/functional.py", line 1252, in dropout
return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 79.17 GiB total capacity; 77.91 GiB already allocated; 59.81 MiB free; 77.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
train command: python train.py --model_name_or_path sentence-transformers/gtr-t5-large --output_dir output --cache_dir medi-data --max_source_length 512 --num_train_epochs 1 0 --save_steps 500 --cl_temperature 0.01 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir --per_device_train_batch_size 16
I have tried with both a100 40G and a100 80G same results. log: O|trainer.py:1516] 2023-07-19 15:14:59,214 >> Running training [INFO|trainer.py:1517] 2023-07-19 15:14:59,214 >> Num examples = 1416768 [INFO|trainer.py:1518] 2023-07-19 15:14:59,214 >> Num Epochs = 10 [INFO|trainer.py:1519] 2023-07-19 15:14:59,214 >> Instantaneous batch size per device = 16 [INFO|trainer.py:1520] 2023-07-19 15:14:59,214 >> Total train batch size (w. parallel, distributed & accumulation) = 64 [INFO|trainer.py:1521] 2023-07-19 15:14:59,214 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1522] 2023-07-19 15:14:59,214 >> Total optimization steps = 221370
File "/opt/conda/envs/instructor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/envs/instructor/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py", line 534, in forward attn_weights, p=self.dropout, training=self.training File "/opt/conda/envs/instructor/lib/python3.7/site-packages/torch/nn/functional.py", line 1252, in dropout return VF.dropout(input, p, training) if inplace else _VF.dropout(input, p, training) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 79.17 GiB total capacity; 77.91 GiB already allocated; 59.81 MiB free; 77.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF