xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models
https://arxiv.org/abs/2201.05966
Apache License 2.0
549 stars 58 forks source link

GPU and batch size setting #11

Closed cdhx closed 2 years ago

cdhx commented 2 years ago

I am using the training command in readme python -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_wikitq.cfg --run_name T5_base_finetune_wikitq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 8 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_wikitq --overwrite_output_dir --per_device_train_batch_size 4 --per_device_eval_batch_size 16 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true my question is how to set GPU and batch size, it said this command is 4 GPU and 128 batch size, but i didn't see it in this command, neither in the code Thx

cdhx commented 2 years ago

i have some idea about the gpu, in modeling_t5.py

self.device_map = (
            get_device_map(len(self.block), range(torch.cuda.device_count())) if device_map is None else device_map
        )

should I change range(torch.cuda.device_count()) to the GPU I want to use? e.g. range(2,3) to only use GPU 2?

Timothyxxx commented 2 years ago

Hi,

1) About the GPU to run a job on: I don't know how your machines organize GPUs, but I think this part of code is from the original huggingface transformers in that version and we didn't change that.

Allowing me to make a guess and recommend: did you try setting CUDA_VISIBLE_DEVICES=2? It seems the solution.

2) About the batch size: The equivalent batch size is the machine number * per_device_train_batch_size * gradient_accumulation_steps (take batch size during training as an example). If you want to set that to 128 by using 4 machines, the per_device_train_batch_size * gradient_accumulation_steps should be set to 32, depends on the GPU memory.

Hope information above helpful! Thanks

cdhx commented 2 years ago

thanks for your replay

I did not set it, it seems the solution. Do you mean "t5-based: 4 GPU", "t5-3b: 8 GPU" is the original setting in the code of huggingface? I misunderstand that the GPU num is set by you in command or code.

For the batch size, I am confused that in the first command, per_device_train_batch_size is 4 and GPU num is 4, why effective batch size is 128?

Timothyxxx commented 2 years ago

Oh I see the problem, there is another arg gradient_accumulation_steps, which should be multiplied for effective bsz.

cdhx commented 2 years ago

I have understood, thanks for your explanation and I still have two questions while training

  1. where is the program entry, I am not familiar with this torch.distributed.launch, is the training.py?

  2. if I want to run it in IDE like Pycharm, should I change the code in train.py by replacing it with my project and entity name?

        wandb.init(
            project=os.getenv("WANDB_PROJECT", "uni-frame-for-knowledge-tabular-tasks"),
            name=training_args.run_name,
            entity=os.getenv("WANDB_ENTITY", 'sgtnew'),
            **init_args,
        )

Although I have config the environment variables, but when I run it in pycharm, it can not log in wandb successfully, but if i replace with my project and entity name, it works fine. Besides, it works fine in command line.

wandb: ERROR Error while calling W&B API: permission denied (<Response [401]>)
Timothyxxx commented 2 years ago

Hi, 1, Take python -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py ....... for example, the torch.distributed.launch --nproc_per_node 4 --master_port 1234 is for configuring the ports when using distribute training, and you can consider that as an addition upon python -m train.py, then it is the almost the same with regular usage. 2, Sorry for confusion, please check out the set up instruction of wandb and set the environment args as your own project and entity name.

cdhx commented 2 years ago

Got it, Thank you for your detailed reply!

cdhx commented 2 years ago

If i use less than 4 GPUs, it can not run through, If i choose 4 GPUs, it works fine, and if I do not choose GPUs(i have 5 GPUs) it also only choose 4GPUs and works fine You also said in the readme that this command use 4 GPUs, it seems that simply set CUDA_VISIBLE_DEVICES=2 not work. So is there any way to train on a single or two GPUs? Thx

Timothyxxx commented 2 years ago

Hi,

I think it may be because in our command, there is a --nproc_per_node 4 which should be corresponded to the number of GPUs?

Hope this information helpful! Thanks

cdhx commented 2 years ago

Hi,

I think it may be because in our command, there is a --nproc_per_node 4 which should be corresponded to the number of GPUs?

Hope this information helpful! Thanks

It works, thank for your reply this days.

Timothyxxx commented 2 years ago

You are welcome. Thanks for your attention to our work!