Issue with gradient accumulation steps in training

vishaal27 commented 4 months ago

Hey, thanks for the great work and releasing the training code and the model publicly.

I am trying a sample training run on two GPUs using the command you provided:

accelerate launch --dynamo_backend no --gpu_ids all --num_processes 2 --num_machines 1 --use_deepspeed trainer/scripts/train.py +experiment=clip_h output_dir=output

However, when I try to train using this, I see that the gradient accumulation steps automatically get set to 1. I see this in the train.log:

[2024-02-13 17:03:44,255][accelerate.accelerator][INFO] - Since you passed both train and evaluation dataloader, `is_train_batch_min` (here True will decide the `train_batch_size` (16).
[2024-02-13 17:03:44,256][accelerate.accelerator][INFO] - Updating DeepSpeed's gradient accumulation steps to 1 from 16.

I did not modify anything in the training script; it somehow automatically updates it. Did you face this in some of your earlier runs too?

This is my deepspeed config:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: ds_config.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config: {}
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The ds_config.json has this:

{
    "gradient_accumulation_steps": 16
}

Based on these I am not sure why this happens, do you have any ideas on how to debug this and get gradient accumulation working?

yuvalkirstain commented 4 months ago

I didn't face such issue in my runs - the gradient_accumulation_steps should be mentioned here. Perhaps an update of accelerate takes the value from accelerate. Try to feed the gradient_accumulation_steps directly to accelerate here similarly to how they do it in this tutorial. Good luck :)

vishaal27 commented 4 months ago

Hey, thanks for your response. It seems like the gradient accumulation steps in the script is not read from the deepspeed part of the config but the head config. So I simply added this into the main config file:

accelerator:
  gradient_accumulation_steps: 16

It works now! Closing the issue now.

yuvalkirstain / PickScore

Issue with gradient accumulation steps in training #18