Closed vishaal27 closed 4 months ago
I didn't face such issue in my runs - the gradient_accumulation_steps should be mentioned here. Perhaps an update of accelerate takes the value from accelerate. Try to feed the gradient_accumulation_steps directly to accelerate here similarly to how they do it in this tutorial. Good luck :)
Hey, thanks for your response. It seems like the gradient accumulation steps in the script is not read from the deepspeed part of the config but the head config. So I simply added this into the main config file:
accelerator:
gradient_accumulation_steps: 16
It works now! Closing the issue now.
Hey, thanks for the great work and releasing the training code and the model publicly.
I am trying a sample training run on two GPUs using the command you provided:
However, when I try to train using this, I see that the gradient accumulation steps automatically get set to 1. I see this in the
train.log
:I did not modify anything in the training script; it somehow automatically updates it. Did you face this in some of your earlier runs too?
This is my deepspeed config:
The
ds_config.json
has this:Based on these I am not sure why this happens, do you have any ideas on how to debug this and get gradient accumulation working?