Failed to train the model with provided command

liming-ai commented 1 year ago

Hi @yuvalkirstain

Thanks for your nice contribution!

I tried to train the model by the official command:

accelerate launch --dynamo_backend no --gpu_ids all --num_processes 8  --num_machines 1 --use_deepspeed trainer/scripts/train.py +experiment=clip_h output_dir=output

but there are some errors:

ValueError: When using `deepspeed_config_file`, the following accelerate config variables will be ignored: ['gradient_accumulation_steps', 'gradient_clipping',
'zero_stage', 'offload_optimizer_device', 'offload_param_device', 'zero3_save_16bit_model', 'mixed_precision'].
Please specify them appropriately in the DeepSpeed config file.
If you are using an accelerate config file, remove others config variables mentioned in the above specified list.
The easiest method is to create a new config following the questionnaire via `accelerate config`.
It will only ask for the necessary config variables when using `deepspeed_config_file`.
[01:26:13] ERROR    failed (exitcode: 1) local_rank: 0 (pid: 1527604) of binary: /usr/bin/python3

yuvalkirstain commented 1 year ago

Thank you so much for opening this issue! This seems like an accelerate issue that I am not able to reproduce, so let us try to solve it together :)

I suggest that you first make sure that you remove your <cache_dir>/huggingface/accelerate if it exists and try again. If this does not work, then try this (taken from the accelerate/deepspeed guide):

Run accelerate config:

$ accelerate config
-------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine                                                                                                                   
-------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?                                                                                           
multi-GPU                                                                                                                      
How many different machines will you use (use more than 1 for multi-node training)? [1]:                                       
Do you wish to optimize your script with torch dynamo?[yes/NO]: NO                                                               
Do you want to use DeepSpeed? [yes/NO]: yes                                                                                    
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes                                                        
Please enter the path to the json DeepSpeed config file: ds_config.json                                                        
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:4
accelerate configuration saved at ds_config_sample.yaml

Content of the accelerate config:

compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: ds_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

And then try to run the script again.

Please update me on how it goes?

liming-ai commented 1 year ago

This issue is fixed when I remove the /huggingface/accelerate, thanks for your help!

vishaal27 commented 9 months ago

Hi @liming-ai @yuvalkirstain , I am facing the same issue but I have one main question: in the config, you have deepspeed_config_file: ds_config.json, do we have to get this ds_config.json from somewhere (example: https://github.com/lucidrains/routing-transformer/blob/master/examples/enwik8_deepspeed/ds_config.json) or if I don't have it, the script still will work?

yuvalkirstain commented 9 months ago

@vishaal27 did you try running it and had an exception? the repo is verified so following the instructions should allow you to run.

vishaal27 commented 9 months ago

Yes, thanks, it worked but I now have an issue with the gradient accumulation steps so I opened another issue for it.

yuvalkirstain / PickScore

Failed to train the model with provided command #2