yuvalkirstain / PickScore

MIT License
449 stars 26 forks source link

Failed to train the model with provided command #2

Closed liming-ai closed 1 year ago

liming-ai commented 1 year ago

Hi @yuvalkirstain

Thanks for your nice contribution!

I tried to train the model by the official command:

accelerate launch --dynamo_backend no --gpu_ids all --num_processes 8  --num_machines 1 --use_deepspeed trainer/scripts/train.py +experiment=clip_h output_dir=output

but there are some errors:

ValueError: When using `deepspeed_config_file`, the following accelerate config variables will be ignored: ['gradient_accumulation_steps', 'gradient_clipping',
'zero_stage', 'offload_optimizer_device', 'offload_param_device', 'zero3_save_16bit_model', 'mixed_precision'].
Please specify them appropriately in the DeepSpeed config file.
If you are using an accelerate config file, remove others config variables mentioned in the above specified list.
The easiest method is to create a new config following the questionnaire via `accelerate config`.
It will only ask for the necessary config variables when using `deepspeed_config_file`.
[01:26:13] ERROR    failed (exitcode: 1) local_rank: 0 (pid: 1527604) of binary: /usr/bin/python3
yuvalkirstain commented 1 year ago

Thank you so much for opening this issue! This seems like an accelerate issue that I am not able to reproduce, so let us try to solve it together :)

I suggest that you first make sure that you remove your <cache_dir>/huggingface/accelerate if it exists and try again. If this does not work, then try this (taken from the accelerate/deepspeed guide):

  1. Run accelerate config:
    $ accelerate config
    -------------------------------------------------------------------------------------------------------------------------------
    In which compute environment are you running?
    This machine                                                                                                                   
    -------------------------------------------------------------------------------------------------------------------------------
    Which type of machine are you using?                                                                                           
    multi-GPU                                                                                                                      
    How many different machines will you use (use more than 1 for multi-node training)? [1]:                                       
    Do you wish to optimize your script with torch dynamo?[yes/NO]: NO                                                               
    Do you want to use DeepSpeed? [yes/NO]: yes                                                                                    
    Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes                                                        
    Please enter the path to the json DeepSpeed config file: ds_config.json                                                        
    Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
    How many GPU(s) should be used for distributed training? [1]:4
    accelerate configuration saved at ds_config_sample.yaml
  2. Content of the accelerate config:
    compute_environment: LOCAL_MACHINE
    deepspeed_config:
    deepspeed_config_file: ds_config.json
    zero3_init_flag: true
    distributed_type: DEEPSPEED
    downcast_bf16: 'no'
    dynamo_backend: 'NO'
    fsdp_config: {}
    machine_rank: 0
    main_training_function: main
    megatron_lm_config: {}
    num_machines: 1
    num_processes: 4
    rdzv_backend: static
    same_network: true
    use_cpu: false

    And then try to run the script again.

Please update me on how it goes?

liming-ai commented 1 year ago

This issue is fixed when I remove the /huggingface/accelerate, thanks for your help!

vishaal27 commented 9 months ago

Hi @liming-ai @yuvalkirstain , I am facing the same issue but I have one main question: in the config, you have deepspeed_config_file: ds_config.json, do we have to get this ds_config.json from somewhere (example: https://github.com/lucidrains/routing-transformer/blob/master/examples/enwik8_deepspeed/ds_config.json) or if I don't have it, the script still will work?

yuvalkirstain commented 9 months ago

@vishaal27 did you try running it and had an exception? the repo is verified so following the instructions should allow you to run.

vishaal27 commented 9 months ago

Yes, thanks, it worked but I now have an issue with the gradient accumulation steps so I opened another issue for it.