ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

Efficient training #65

Open ybracke opened 1 year ago

ybracke commented 1 year ago

See this post

Benchmarks for GPU A100

Update the following arguments to Seq2SeqTrainingArguments() for memory- or speed-efficient training:

ybracke commented 1 year ago

gradient_accumulation_steps: "If we wanted to train with a batch size of 64 we should not use per_device_train_batch_size=1 and gradient_accumulation_steps=64 but instead per_device_train_batch_size=4 and gradient_accumulation_steps=16 which has the same effective batch size while making better use of the available GPU resources. [...] If the desired batch size fits into memory then there is no reason to apply gradient accumulation which will only slow down training."

memory

ybracke commented 1 year ago

gradient_checkpointing: some gradients computed during forward pass are saved for backward pass, but not all -> less memory usage. It slows down training by 20%.

memory

ybracke commented 1 year ago

optim="adafactor": Can lead to massive memory savings (3x in hf example). "One downside of Adafactor is that in some instances convergence can be slower than Adam’s so some experimentation is advised here."

memory

ybracke commented 1 year ago