lora预训练13B模型需要多大内存的GPU。单机双卡 2*24GB 会爆显

wangxigui commented 1 year ago

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 由于相关依赖频繁更新，请确保按照Wiki中的相关步骤执行
[X] 我已阅读FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
[X] 第三方插件问题：例如llama.cpp、text-generation-webui、LlamaChat等，同时建议到对应的项目中查找解决方案
[X] 模型正确性检查：务必检查模型的SHA256.md，模型不对的情况下无法保证效果和正常运行

问题类型

模型训练与精调

基础模型

Alpaca-Plus-13B

操作系统

Linux

详细描述问题

lr=2e-4
lora_rank=8
lora_alpha=32
lora_trainable="q_proj,v_proj"
#lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
#modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model='/home/ps/workspace/llm/Chinese-LLaMA-Alpaca/output/merged_13b'
chinese_tokenizer_path='/home/ps/workspace/llm/Chinese-LLaMA-Alpaca/output/merged_13b'
dataset_dir='/home/ps/workspace/llm/Chinese-LLaMA-Alpaca/data'
data_cache='/home/ps/workspace/llm/Chinese-LLaMA-Alpaca/cache'
max_steps=5
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
output_dir='/home/ps/workspace/llm/Chinese-LLaMA-Alpaca/output/law_13b'

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 2 run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 200 \
    --max_steps ${max_steps} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size 512 \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16

依赖情况（代码类问题务必提供）

No response

运行日志或截图

OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB (GPU 0; 23.65 GiB total capacity; 21.17 GiB already allocated; 73.25 MiB free; 21.17 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 551195 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 551194) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/home/ps/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ps/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ps/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/ps/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/ps/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ps/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-08_00:09:34
  host      : ps.ps
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 551194)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

iMountTai commented 1 year ago

13B模型本身权重就有24G了，加上lora等优化器状态后，24G肯定跑不了训练，建议探索zero3训练或者升级硬件

wangxigui commented 1 year ago

使用 8bit 量化可以正常加载了


         torch_dtype = (
             model_args.torch_dtype
@@ -535,6 +533,8 @@ def main():
             revision=model_args.model_revision,
             use_auth_token=True if model_args.use_auth_token else None,
             torch_dtype=torch_dtype,
+            load_in_8bit=True,
+            device_map='auto',
             low_cpu_mem_usage=True
         )
     else:
@@ -558,6 +558,7 @@ def main():
             "- Continue pre-training Chinese Alpaca: 49954 / 49954 \n")

     model.resize_token_embeddings(len(tokenizer))
+
     if training_args.peft_path is not None:
         logger.info("Peft from pre-trained model")
         model = PeftModel.from_pretrained(model, training_args.peft_path)
@@ -581,11 +582,14 @@ def main():
             modules_to_save=modules_to_save)
         model = get_peft_model(model, peft_config)
     model.print_trainable_parameters()
+
     old_state_dict = model.state_dict
     model.state_dict = (
         lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
     ).__get__(model, type(model))

+    model = prepare_model_for_int8_training(model)
+```

但是目前微调的时候遇到一个报错（优化器 数组越界 IndexError: list index out of range ），求指点一下可能什么原因？
│ /home/ps/.local/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:270 in      │
│ __init__                                                                                         │
│                                                                                                  │
│    267 │   │   ), f"allgather_bucket_size must be a multiple of nccl_start_alignment_factor, {s  │
│    268 │   │                                                                                     │
│    269 │   │   self.all_reduce_print = False                                                     │
│ ❱  270 │   │   **self.dtype = self.optimizer.param_groups[0]['params'][0].dtype**                    │
│    271 │   │                                                                                     │
│    272 │   │   self.round_robin_bit16_groups = []                                                │
│    273 │   │   self.round_robin_bit16_indices = []                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
**IndexError: list index out of range**

![image](https://github.com/ymcui/Chinese-LLaMA-Alpaca/assets/6872439/2bfdffde-6481-4eeb-beb5-451048acbe01)
![image](https://github.com/ymcui/Chinese-LLaMA-Alpaca/assets/6872439/ddf0c267-e279-4205-8a07-d4515e6f35af)

wangjvjie commented 1 year ago

我用的4块卡可以跑

wangxigui commented 1 year ago

我用的4块卡可以跑

有用 8bit 量化加载吗？或者别的降显存的操作

zhangxueren9 commented 1 year ago

我用的4块卡可以跑

4张 3090吗

Geministudents commented 1 year ago

单卡45G是够的，大概吃42G

zhangxueren9 commented 1 year ago

单卡45G是够的，大概吃42G

谢谢

bigcash commented 1 year ago

我的4个3090，用lora微调pt，显存都用满了，最多用到23G

zhangxueren9 commented 1 year ago

23G

句子长度多少啊

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 1 year ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

ymcui / Chinese-LLaMA-Alpaca