Open ybracke opened 1 year ago
gradient_accumulation_steps
: "If we wanted to train with a batch size of 64 we should not use per_device_train_batch_size=1 and gradient_accumulation_steps=64 but instead per_device_train_batch_size=4 and gradient_accumulation_steps=16 which has the same effective batch size while making better use of the available GPU resources. [...] If the desired batch size fits into memory then there is no reason to apply gradient accumulation which will only slow down training."
gradient_checkpointing
: some gradients computed during forward pass are saved for backward pass, but not all -> less memory usage. It slows down training by 20%.
optim="adafactor": Can lead to massive memory savings (3x in hf example). "One downside of Adafactor is that in some instances convergence can be slower than Adam’s so some experimentation is advised here."
See this post
Benchmarks for GPU A100
Update the following arguments to
Seq2SeqTrainingArguments()
for memory- or speed-efficient training: