DDP training default - Githubissues

eyuansu62 commented 2 years ago

Does the train.py use the DDP training default, even if the n_gpus = 1?

ChenWu98 commented 2 years ago

Yes, if you use our provided commands. To disable DDP, you may try something like

python train.py --seed 2 --cfg Salesforce/T5_base_finetune_wikitq.cfg --run_name T5_base_finetune_wikitq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_wikitq --overwrite_output_dir --per_device_train_batch_size 4 --per_device_eval_batch_size 16 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024

But our code has not been tested on this usage.

eyuansu62 commented 2 years ago

BTW, have you ever tried UnifiedSKG + PICARD?

Timothyxxx commented 2 years ago

Nope. It's an interesting direction to design a unified-grammar and design PICARD on it.

eyuansu62 commented 2 years ago

Sorry, one more little question, orz :) If I do not use deepspeed, I train t5-3b on spider with bs=2 and DDP, it will cause OOM error. Is it normal?

ChenWu98 commented 2 years ago

I remember that t5-3b requires deepspeed even on the 32G V100.

eyuansu62 commented 2 years ago

How about 40G A100?

Timothyxxx commented 2 years ago

Surely ok.

Timothyxxx commented 2 years ago

Feel free to contact us if you have further questions!

xlang-ai / UnifiedSKG

DDP training default #25