How does `accumulate_gradient_steps` work?

Hi,

I am unsure I understand the logic behind accumulate_gradient_steps. I have these 3 configurations:

batch_size=1, accumulate_gradient_steps=1 -> blue
batch_size=2, accumulate_gradient_steps=1 -> red
batch_size=2, accumulate_gradient_steps=2 -> green

My initial understanding is that when doing grad accumulation, accumulate_gradient_steps forwards + backward steps and then the optimizer takes a step.

1/ I don't see such where that logic is handled in llama_train.py. it looks like in the method train_step, there is no counter for accumulate_gradient_steps and optimizer steps are taken after each forward? 2/ the logging is confusing: I would have expected the red and blue line to be overlapped, not the blue and green.

Is it possible that step is the counter of forward+backward operations and not the counter of (forward+backward) x grad_acc + optimizer_step?

young-geng / EasyLM

How does `accumulate_gradient_steps` work? #108