Closed liming-ai closed 1 year ago
Thank you so much for opening this issue! This seems like an accelerate
issue that I am not able to reproduce, so let us try to solve it together :)
I suggest that you first make sure that you remove your <cache_dir>/huggingface/accelerate
if it exists and try again. If this does not work, then try this (taken from the accelerate/deepspeed guide):
$ accelerate config
-------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]: NO
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: ds_config.json
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:4
accelerate configuration saved at ds_config_sample.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: ds_config.json
zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false
And then try to run the script again.
Please update me on how it goes?
This issue is fixed when I remove the
Hi @liming-ai @yuvalkirstain , I am facing the same issue but I have one main question: in the config, you have deepspeed_config_file: ds_config.json
, do we have to get this ds_config.json
from somewhere (example: https://github.com/lucidrains/routing-transformer/blob/master/examples/enwik8_deepspeed/ds_config.json) or if I don't have it, the script still will work?
@vishaal27 did you try running it and had an exception? the repo is verified so following the instructions should allow you to run.
Yes, thanks, it worked but I now have an issue with the gradient accumulation steps so I opened another issue for it.
Hi @yuvalkirstain
Thanks for your nice contribution!
I tried to train the model by the official command:
but there are some errors: