I am unsure I understand the logic behind accumulate_gradient_steps.
I have these 3 configurations:
batch_size=1, accumulate_gradient_steps=1 -> blue
batch_size=2, accumulate_gradient_steps=1 -> red
batch_size=2, accumulate_gradient_steps=2 -> green
My initial understanding is that when doing grad accumulation, accumulate_gradient_steps forwards + backward steps and then the optimizer takes a step.
1/ I don't see such where that logic is handled in llama_train.py. it looks like in the method train_step, there is no counter for accumulate_gradient_steps and optimizer steps are taken after each forward?
2/ the logging is confusing: I would have expected the red and blue line to be overlapped, not the blue and green.
Is it possible that step is the counter of forward+backward operations and not the counter of (forward+backward) x grad_acc + optimizer_step?
Hi,
I am unsure I understand the logic behind
accumulate_gradient_steps
. I have these 3 configurations:My initial understanding is that when doing grad accumulation,
accumulate_gradient_steps
forwards + backward steps and then the optimizer takes a step.1/ I don't see such where that logic is handled in
llama_train.py
. it looks like in the methodtrain_step
, there is no counter foraccumulate_gradient_steps
and optimizer steps are taken after each forward? 2/ the logging is confusing: I would have expected the red and blue line to be overlapped, not the blue and green.Is it possible that
step
is the counter offorward+backward
operations and not the counter of(forward+backward) x grad_acc + optimizer_step
?