internlm-sft 单机多卡微调 GPU 利用率低

Shamepoo commented 5 months ago

GPU数量越多，利用率越低，总体速度和单卡持平

train_sft.py 修改，添加model_max_length参数：

        model_args.model_name_or_path, trust_remote_code=True, model_max_length=training_args.model_max_length)

这里不修改会报错

Traceback (most recent call last):
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 1377, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3466, in _map_single
    batch = apply_function_on_filtered_inputs(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3345, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/services/lmt/repos/llms/zero_nlp/internlm-sft/train_sft.py", line 174, in generate_sources_targets
    input_output = preprocess(
                   ^^^^^^^^^^^
  File "/home/services/lmt/repos/llms/zero_nlp/internlm-sft/train_sft.py", line 125, in preprocess
    examples_tokenized, sources_tokenized = [_tokenize_fn(
                                            ^^^^^^^^^^^^^^
  File "/home/services/lmt/repos/llms/zero_nlp/internlm-sft/train_sft.py", line 125, in <listcomp>
    examples_tokenized, sources_tokenized = [_tokenize_fn(
                                             ^^^^^^^^^^^^^
  File "/home/services/lmt/repos/llms/zero_nlp/internlm-sft/train_sft.py", line 94, in _tokenize_fn
    tokenized_list = [
                     ^
  File "/home/services/lmt/repos/llms/zero_nlp/internlm-sft/train_sft.py", line 95, in <listcomp>
    tokenizer(
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2829, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 2935, in _call_one
    return self.encode_plus(
           ^^^^^^^^^^^^^^^^^
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3008, in encode_plus
    return self._encode_plus(
           ^^^^^^^^^^^^^^^^^^
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 576, in _encode_plus
    batched_output = self._batch_encode_plus(
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 496, in _batch_encode_plus
    self.set_truncation_and_padding(
  File "/home/services/anaconda3/envs/text-webui-311/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 451, in set_truncation_and_padding
    self._tokenizer.enable_truncation(**target)
OverflowError: int too big to convert

train_zero2.sh

deepspeed --include localhost:1,2,3 train_sft.py \
    --deepspeed ds_zero2_no_offload.json \
    --model_name_or_path /home/services/lmt/repos/llms/models/google/gemma-2b \
    --use_lora true \
    --use_deepspeed true \
    --data_path /home/services/lmt/repos/llms/data/sft \
    --bf16 true \
    --fp16 false \
    --output_dir output_refusev2 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 3 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_total_limit 3 \
    --learning_rate 4e-5 \
    --logging_steps 100 \
    --tf32 False \
    --model_max_length 2048 \
    --report_to "wandb" \
    --save_steps 20000 \
    --dataloader_num_workers 64 \

训练报告： https://api.wandb.ai/links/a86056549/zwfn6e72

依赖版本： transformers==4.38.2 peft==0.9.0 deepspeed==0.14.0

yuanzhoulvpi2017 commented 5 months ago

大概看了一下，感觉代码是没啥问题。

3090单机多卡慢，感觉可能是硬件上的问题，一般我用这套代码，在A800，A100上，V100上，都是没啥问题的。

Shamepoo commented 5 months ago

好的感谢

在多卡上训练的 loss 波动也比较大，好奇怪

Shamepoo commented 5 months ago

破案了我分别用一张卡跑不同的微调的时候两张卡的性能都上不去，可能是供电不足吧

yuanzhoulvpi2017 commented 5 months ago

哈哈哈，原来如此，确实是硬件方面的问题可能性大一些～

Shamepoo commented 5 months ago

多谢~

yuanzhoulvpi2017 / zero_nlp

internlm-sft 单机多卡微调 GPU 利用率低 #170