I set the parameter gradient_accumulation_steps to 1,bachsize to 1 and use LoRA to make the number of trainable parameters reduce to 3,276,800.However,with two v100(32G),I still can't run this experiment for CUDA out of memory.
What other methods can reduce the need for video memory?
Hi, I use 8 A100 GPUs with 80GB of memory each to fine-tune the model. For your case, I suggest using FP16 training and reducing the number of LoRA trainable parameters to conduct the experiments.
I set the parameter gradient_accumulation_steps to 1,bachsize to 1 and use LoRA to make the number of trainable parameters reduce to 3,276,800.However,with two v100(32G),I still can't run this experiment for CUDA out of memory. What other methods can reduce the need for video memory?