yizhongw / Tk-Instruct

Tk-Instruct is a Transformer model that is tuned to solve many NLP tasks by following instructions.
https://arxiv.org/abs/2204.07705
MIT License
177 stars 27 forks source link

finetune 11b model #22

Closed zhilizju closed 1 year ago

zhilizju commented 1 year ago

Hi, nice work ! In this paper, for 11b model, "These experiments are run on Google V3-256 TPUs ". for T5 models smaller than 11b, "These experiments are conducted with 8 A100 GPUs with 48GB GPU memory ". Have you tried fine-tuning the 11b model on 8 A100 GPUs with 48GB. I am trying to finetune t5 11b with deepspeed on 8 RTX6000 GPUs (48G). I use your script and just modify google/t5-xl-lm-adapt to google/t5-xxl-lm-adapt. When I use 'ds_configs/stage2.config', I meet the error.

[INFO|modeling_utils.py:1770] 2023-03-04 18:11:21,751 >> loading weights file /home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt/pytorch_model.bin [2023-03-04 18:21:36,276] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689226 [2023-03-04 18:21:36,305] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689227 [2023-03-04 18:21:37,442] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689228 [2023-03-04 18:21:38,699] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689229 [2023-03-04 18:21:40,033] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689230 [2023-03-04 18:21:41,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689231 [2023-03-04 18:21:42,864] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689232 [2023-03-04 18:21:44,274] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3689233 [2023-03-04 18:21:45,687] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '100', '--max_num_instances_per_eval_task', '100', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/stage2.config', '--bf16', '--run_name', 't5-experiment'] exits with return code = -9

I guess that the memory is not enough, but I have noticed that other project finetune model with similar gpus. So I try to modify the 'ds_configs/stage2.config' to 'ds_configs/stage3.config'. The server is stuck and the error is similar to the one above:

[2023-03-05 03:26:26,366] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3691645 [2023-03-05 03:26:26,366] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3691646 [2023-03-05 03:26:27,663] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3691647 [2023-03-05 03:26:28,920] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3691648 [2023-03-05 03:26:30,178] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3691649 [2023-03-05 03:26:31,553] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3691650 [2023-03-05 03:26:32,846] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '1', '--max_num_instances_per_eval_task', '1', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/stage3.config'] exits with return code = -9

Can you give me some advice?

zhilizju commented 1 year ago

I further modify the config of deepspeed. Just add a new config (name it 11b_stage3_offload.config)under the folder ds_configs The content of the new config is :

{ "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": false }, "offload_param": { "device": "cpu", "pin_memory": false }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": false }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

It still doesn't work. And ds_report output:

[2023-03-05 06:40:25,173] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698707 [2023-03-05 06:40:27,249] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698708 [2023-03-05 06:40:27,250] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698709 [2023-03-05 06:40:28,626] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698710 [2023-03-05 06:40:30,045] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698711 [2023-03-05 06:40:31,498] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698712 [2023-03-05 06:40:32,912] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698713 [2023-03-05 06:40:34,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698714 [2023-03-05 06:40:35,705] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '1', '--max_num_instances_per_eval_task', '1', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/11b_stage3_offload.config', '--bf16', '--run_name', 't5-experiment'] exits with return code = -9

System info (please complete the following information):

OS: [e.g. Ubuntu 20.04] GPU count and types [ one machines with x8 RTX6000, 48G each GPU. ] Python version 3.8.16 Launcher context

!/bin/bash

set -x

export CUDA_DEVICE_ORDER="PCI_BUS_ID" export TRANSFORMERS_CACHE=/home/lizhi/.cache/huggingface

port=$(shuf -i25000-30000 -n1)

deepspeed --master_port $port src/run_s2s.py --do_train --do_predict --predict_with_generate --model_name_or_path google/t5-xxl-lm-adapt --max_source_length 1024 --max_target_length 128 --generation_max_length 128 --max_num_instances_per_task 1 --max_num_instances_per_eval_task 1 --add_task_name False --add_task_definition True --num_pos_examples 2 --num_neg_examples 0 --add_explanation False --tk_instruct False --data_dir data/splits/default --task_dir data/tasks --output_dir output/ --overwrite_output_dir --cache_dir ./cache/ --overwrite_cache --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --learning_rate 5e-05 --num_train_epochs 1 --lr_scheduler_type constant --warmup_steps 0 --logging_strategy steps --logging_steps 500 --evaluation_strategy no --save_strategy steps --save_steps 2500 --deepspeed ds_configs/11b_stage3_offload.config --bf16 --run_name t5-experiment

Any help ? @yizhongw Thank you.