ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
18.23k stars 1.86k forks source link

run_pt增量预训练,中断后重新训练,失败 #742

Closed smartparrot closed 1 year ago

smartparrot commented 1 year ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

LLaMA-7B

操作系统

Linux

详细描述问题

运行run_pt.sh进行增量预训练,中断训练后,再重新运行代码,始终都是失败,请问这是怎么回事,基座模型是baichuan

lr=2e-4
lora_rank=8
lora_alpha=32
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
# modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model="/root/LLM/baichuan-7B" #path/to/hf/llama/dir
chinese_tokenizer_path="/root/LLM/baichuan-7B" #path/to/chinese/llama/tokenizer/dir
dataset_dir="/root/LLM/Chinese-LLaMA-Alpaca/data/mydata" #path/to/pt/data/dir
data_cache=temp_data_cache_dir
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
output_dir=output_dir

deepspeed_config_file=ds_zero2_no_offload.json

echo $RANDOM

# CUDA_VISIBLE_DEVICES和nproc_per_node数量要对应
# 训练中断重新开始报错,考虑lora
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nnodes 1 --nproc_per_node 3 run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.01 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --evaluation_strategy "epoch" \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 100 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 10 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size 512 \
    --output_dir ${output_dir} \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --ddp_find_unused_parameters False

依赖情况(代码类问题务必提供)

absl-py                  1.4.0
accelerate               0.19.0
aiofiles                 23.1.0
aiohttp                  3.8.4
aiosignal                1.3.1
altair                   4.2.2
anyio                    3.7.0
async-timeout            4.0.2
attrs                    23.1.0
backports.zoneinfo       0.2.1
beautifulsoup4           4.12.2
bitsandbytes             0.37.1
blinker                  1.6.2
bs4                      0.0.1
cachetools               5.3.0
certifi                  2023.5.7
charset-normalizer       3.1.0
click                    8.1.3
cmake                    3.26.3
contourpy                1.0.7
cpm-kernels              1.0.11
cycler                   0.11.0
datasets                 2.10.1
decorator                5.1.1
deepspeed                0.9.2
diffusers                0.16.1
dill                     0.3.6
entrypoints              0.4
exceptiongroup           1.1.1
fastapi                  0.97.0
ffmpy                    0.3.0
filelock                 3.12.0
fonttools                4.40.0
frozenlist               1.3.3
fsspec                   2023.5.0
gitdb                    4.0.10
GitPython                3.1.31
google-auth              2.18.1
google-auth-oauthlib     1.0.0
gradio                   3.34.0
gradio_client            0.2.6
grpcio                   1.54.2
h11                      0.14.0
hjson                    3.1.0
httpcore                 0.17.2
httpx                    0.24.1
huggingface-hub          0.14.1
icetk                    0.0.4
idna                     3.4
importlib-metadata       6.6.0
importlib-resources      5.12.0
iniconfig                2.0.0
jieba                    0.42.1
Jinja2                   3.1.2
joblib                   1.2.0
jsonschema               4.17.3
kiwisolver               1.4.4
latex2mathml             3.76.0
linkify-it-py            2.0.2
lit                      16.0.5
Markdown                 3.4.3
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
mdit-py-plugins          0.3.3
mdtex2html               1.2.0
mdurl                    0.1.2
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.14
networkx                 3.1
ninja                    1.11.1
numpy                    1.24.3
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
oauthlib                 3.2.2
orjson                   3.9.1
packaging                23.1
pandas                   2.0.1
peft                     0.3.0
Pillow                   9.5.0
pip                      23.0.1
pkgutil_resolve_name     1.3.10
pluggy                   1.0.0
protobuf                 3.20.0
psutil                   5.9.5
py-cpuinfo               9.0.0
pyarrow                  12.0.0
pyasn1                   0.5.0
pyasn1-modules           0.3.0
pydantic                 1.10.9
pydeck                   0.8.1b0
pydub                    0.25.1
Pygments                 2.15.1
Pympler                  1.0.1
pyparsing                3.0.9
pyrsistent               0.19.3
pytest                   7.3.1
python-dateutil          2.8.2
python-multipart         0.0.6
pytz                     2023.3
PyYAML                   6.0
regex                    2023.5.5
requests                 2.30.0
requests-oauthlib        1.3.1
responses                0.18.0
rich                     13.4.2
rouge-chinese            1.0.3
rsa                      4.9
safetensors              0.3.1
scikit-learn             1.2.2
scipy                    1.10.1
semantic-version         2.10.0
sentencepiece            0.1.99
setuptools               67.8.0
six                      1.16.0
sklearn                  0.0
smmap                    5.0.0
sniffio                  1.3.0
soupsieve                2.4.1
starlette                0.27.0
streamlit                1.22.0
streamlit-chat           0.0.2.2
sympy                    1.12
tenacity                 8.2.2
tensorboard              2.13.0
tensorboard-data-server  0.7.0
threadpoolctl            3.1.0
tokenizers               0.13.3
toml                     0.10.2
tomli                    2.0.1
toolz                    0.12.0
torch                    1.13.1
torchvision              0.15.2
tornado                  6.3.2
tqdm                     4.65.0
transformers             4.29.2
triton                   2.0.0
typing_extensions        4.5.0
tzdata                   2023.3
tzlocal                  5.0.1
uc-micro-py              1.0.2
urllib3                  1.26.15
uvicorn                  0.22.0
validators               0.20.0
watchdog                 3.0.0
websockets               11.0.3
Werkzeug                 2.3.4
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.9.2
zipp                     3.15.0

运行日志或截图

[INFO|deepspeed.py:390] 2023-07-12 17:35:55,173 >> Attempting to resume from output_dir/checkpoint-890
[2023-07-12 17:35:55,174] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt...
[2023-07-12 17:36:02,326] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt.
[2023-07-12 17:36:02,964] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt...
[2023-07-12 17:36:09,887] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /root/LLM/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py:651  │
│ in <module>                                                                  │
│                                                                              │
│   648                                                                        │
│   649                                                                        │
│   650 if __name__ == "__main__":                                             │
│ ❱ 651 │   main()                                                             │
│   652                                                                        │
│                                                                              │
│ /root/LLM/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py:619  │
│ in main                                                                      │
│                                                                              │
│   616 │   │   elif last_checkpoint is not None:                              │
│   617 │   │   │   checkpoint = last_checkpoint                               │
│   618 │   │   print('checkpoint111111--------------',checkpoint)             │
│ ❱ 619 │   │   train_result = trainer.train(resume_from_checkpoint=checkpoint │
│   620 │   │                                                                  │
│   621 │   │   metrics = train_result.metrics                                 │
│   622                                                                        │
│                                                                              │
│ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/transformers/trainer. │
│ py:1664 in train                                                             │
│                                                                              │
│   1661 │   │   inner_training_loop = find_executable_batch_size(             │
│   1662 │   │   │   self._inner_training_loop, self._train_batch_size, args.a │
│   1663 │   │   )                                                             │
│ ❱ 1664 │   │   return inner_training_loop(                                   │
│   1665 │   │   │   args=args,                                                │
│   1666 │   │   │   resume_from_checkpoint=resume_from_checkpoint,            │
│   1667 │   │   │   trial=trial,                                              │
│                                                                              │
│ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/transformers/trainer. │
│ py:1741 in _inner_training_loop                                              │
│                                                                              │
│   1738 │   │   │   or self.fsdp is not None                                  │
│   1739 │   │   )                                                             │
│   1740 │   │   if args.deepspeed:                                            │
│ ❱ 1741 │   │   │   deepspeed_engine, optimizer, lr_scheduler = deepspeed_ini │
│   1742 │   │   │   │   self, num_training_steps=max_steps, resume_from_check │
│   1743 │   │   │   )                                                         │
│   1744 │   │   │   self.model = deepspeed_engine.module                      │
│                                                                              │
│ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/transformers/deepspee │
│ d.py:392 in deepspeed_init                                                   │
│                                                                              │
│   389 │   │   if len(deepspeed_checkpoint_dirs) > 0:                         │
│   390 │   │   │   logger.info(f"Attempting to resume from {resume_from_check │
│   391 │   │   │   # this magically updates self.optimizer and self.lr_schedu │
│ ❱ 392 │   │   │   load_path, _ = deepspeed_engine.load_checkpoint(           │
│   393 │   │   │   │   resume_from_checkpoint, load_optimizer_states=True, lo │
│   394 │   │   │   )                                                          │
│   395 │   │   │   if load_path is None:                                      │
│                                                                              │
│ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/deepspeed/runtime/eng │
│ ine.py:2588 in load_checkpoint                                               │
│                                                                              │
│   2585 │   │   │   # Prepare for checkpoint load by ensuring all parameters  │
│   2586 │   │   │   self.optimizer.checkpoint_event_prologue()                │
│   2587 │   │                                                                 │
│ ❱ 2588 │   │   load_path, client_states = self._load_checkpoint(load_dir,    │
│   2589 │   │   │   │   │   │   │   │   │   │   │   │   │   │    tag,         │
│   2590 │   │   │   │   │   │   │   │   │   │   │   │   │   │    load_module_ │
│   2591 │   │   │   │   │   │   │   │   │   │   │   │   │   │    load_optimiz │
│                                                                              │
│ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/deepspeed/runtime/eng │
│ ine.py:2647 in _load_checkpoint                                              │
│                                                                              │
│   2644 │   │   │   │   │   │   │   │   │   │   │   │   num_experts=self.num_ │
│   2645 │   │   │   │   │   │   │   │   │   │   │   │   checkpoint_engine=sel │
│   2646 │   │   if not self.load_universal_checkpoint():                      │
│ ❱ 2647 │   │   │   self.load_module_state_dict(checkpoint=checkpoint,        │
│   2648 │   │   │   │   │   │   │   │   │   │   strict=load_module_strict,    │
│   2649 │   │   │   │   │   │   │   │   │   │   custom_load_fn=custom_load_fn │
│   2650                                                                       │
│                                                                              │
│ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/deepspeed/runtime/eng │
│ ine.py:2451 in load_module_state_dict                                        │
│                                                                              │
│   2448 │   │   if custom_load_fn:                                            │
│   2449 │   │   │   custom_load_fn(src=module_state_dict, dst=self.module)    │
│   2450 │   │   else:                                                         │
│ ❱ 2451 │   │   │   self.module.load_state_dict(                              │
│   2452 │   │   │   │   module_state_dict,  # TODO                            │
│   2453 │   │   │   │   strict=strict)                                        │
│   2454                                                                       │
│                                                                              │
│ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/torch/nn/modules/modu │
│ le.py:1671 in load_state_dict                                                │
│                                                                              │
│   1668 │   │   │   │   │   │   ', '.join('"{}"'.format(k) for k in missing_k │
│   1669 │   │                                                                 │
│   1670 │   │   if len(error_msgs) > 0:                                       │
│ ❱ 1671 │   │   │   raise RuntimeError('Error(s) in loading state_dict for {} │
│   1672 │   │   │   │   │   │   │      self.__class__.__name__, "\n\t".join(e │
│   1673 │   │   return _IncompatibleKeys(missing_keys, unexpected_keys)       │
│   1674                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        Missing key(s) in state_dict: 
"base_model.model.model.embed_tokens.weight", 
"base_model.model.model.layers.0.self_attn.q_proj.weight", 
"base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight", 
"base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight", 
"base_model.model.model.layers.0.self_attn.k_proj.weight", 
"base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight", 
"base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight", 
"base_model.model.model.layers.0.self_attn.v_proj.weight", 
"base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight", 
"base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight", 
"base_model.model.model.layers.0.self_attn.o_proj.weight", 
ymcui commented 1 year ago

使用baichuan的问题建议去对应的repo提问。我们的脚本不一定适配。