使用resume_from_checkpoint参数继续预训练出错

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
[X] 第三方插件问题：例如llama.cpp、text-generation-webui等，同时建议到对应的项目中查找解决方案

问题类型

模型训练与精调

基础模型

Alpaca-2-7B

操作系统

Linux

详细描述问题

lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/opt/base_models/chinese-alpaca-2-7b-hf/
chinese_tokenizer_path=/opt/base_models/chinese-alpaca-2-7b-hf/
dataset_dir=/hy-tmp/data/pt
output_dir=/hy-tmp/output/pt
peft_model=path/to/peft/model/dir
data_cache=/hy-tmp/data/pt_tmp
per_device_train_batch_size=4
per_device_eval_batch_size=4
gradient_accumulation_steps=8

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 1 \
    --save_steps 50 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 96 \
    --block_size 4096 \
    --output_dir ${output_dir} \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --modules_to_save ${modules_to_save} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False \
    --flash_attn \
    --resume_from_checkpoint /hy-tmp/result/pt/checkpoint-156

依赖情况（代码类问题务必提供）

absl-py==1.4.0
accelerate==0.21.0
aiohttp==3.8.5
aiosignal==1.3.1
anyio==3.7.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
async-lru==2.0.2
async-timeout==4.0.3
attrs==23.1.0
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
bitsandbytes==0.41.0
bleach==6.0.0
boltons @ file:///croot/boltons_1677628692245/work
cachetools==5.3.1
certifi @ file:///croot/certifi_1683875369620/work/certifi
cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
cmake==3.25.0
comm==0.1.3
conda @ file:///croot/conda_1685025188142/work
conda-content-trust @ file:///tmp/abs_5952f1c8-355c-4855-ad2e-538535021ba5h26t22e5/croots/recipe/conda-content-trust_1658126371814/work
conda-package-handling @ file:///croot/conda-package-handling_1666940373510/work
contourpy==1.0.7
cryptography @ file:///croot/cryptography_1677533068310/work
cycler==0.11.0
daal==2023.1.1
daal4py==2023.1.1
datasets==2.14.4
debugpy==1.6.7
decorator==5.1.1
deepspeed==0.10.1
defusedxml==0.7.1
dill==0.3.7
einops==0.6.1
exceptiongroup==1.1.1
executing==1.2.0
fastjsonschema==2.17.1
filelock==3.9.0
flash-attn @ file:///hy-tmp/flash_attn-2.0.8%2Bcu117torch2.0cxx11abiFALSE-cp38-cp38-linux_x86_64.whl
fonttools==4.39.4
fqdn==1.5.1
frozenlist==1.4.0
fsspec==2023.6.0
google-auth==2.19.0
google-auth-oauthlib==1.0.0
grpcio==1.54.2
hjson==3.1.0
huggingface-hub==0.16.4
idna @ file:///croot/idna_1666125576474/work
importlib-metadata==6.6.0
importlib-resources==5.12.0
ipykernel==6.23.1
ipython==8.12.2
ipython-genutils==0.2.0
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.14
jsonpatch @ file:///tmp/build/80754af9/jsonpatch_1615747632069/work
jsonpointer==2.1
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter-lsp==2.2.0
jupyter_client==8.2.0
jupyter_core==5.3.0
jupyter_server==2.6.0
jupyter_server_terminals==0.4.4
jupyterlab==4.0.0
jupyterlab-language-pack-zh-CN==4.0.post0
jupyterlab-pygments==0.2.2
jupyterlab_server==2.22.1
keras==2.12.0
kiwisolver==1.4.4
lit==15.0.7
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
mistune==2.0.5
mpmath==1.2.1
multidict==6.0.4
multiprocess==0.70.15
nbclassic==1.0.0
nbclient==0.8.0
nbconvert==7.4.0
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==3.0
ninja==1.11.1
notebook==6.5.4
notebook_shim==0.2.3
numpy==1.24.1
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-ml-py==11.525.112
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
nvitop==1.1.2
oauthlib==3.2.2
overrides==7.3.1
packaging @ file:///croot/packaging_1678965309396/work
pandas==2.0.2
pandocfilters==1.5.0
parso==0.8.3
peft @ file:///hy-tmp/llama2/training/peft
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.3.0
pkgutil_resolve_name==1.3.10
platformdirs==3.5.1
pluggy @ file:///tmp/build/80754af9/pluggy_1648042571233/work
prometheus-client==0.17.0
prompt-toolkit==3.0.38
protobuf==4.23.2
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==12.0.1
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycosat @ file:///croot/pycosat_1666805502580/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==1.10.12
Pygments==2.15.1
pyOpenSSL @ file:///croot/pyopenssl_1677607685877/work
pyparsing==3.0.9
pyrsistent==0.19.3
PySocks @ file:///tmp/build/80754af9/pysocks_1605305779399/work
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyYAML==6.0
pyzmq==25.1.0
regex==2023.8.8
requests @ file:///croot/requests_1682607517574/work
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work
ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work
safetensors==0.3.2
scikit-learn==1.3.0
scikit-learn-intelex==2023.1.1
scipy==1.10.1
Send2Trash==1.8.2
sentencepiece==0.1.97
six @ file:///tmp/build/80754af9/six_1644875935023/work
sklearn==0.0.post5
sniffio==1.3.0
soupsieve==2.4.1
stack-data==0.6.2
sympy==1.11.1
tbb==2021.9.0
tensorboard==2.13.0
tensorboard-data-server==0.7.0
termcolor==2.3.0
terminado==0.17.1
threadpoolctl==3.1.0
tinycss2==1.2.1
tokenizers==0.13.3
tomli==2.0.1
toolz @ file:///croot/toolz_1667464077321/work
torch==2.0.1
torchaudio==2.0.1+cu117
torchvision==0.15.1+cu117
tornado==6.3.2
tqdm @ file:///croot/tqdm_1679561862951/work
traitlets==5.9.0
transformers==4.31.0
triton==2.0.0
typing_extensions==4.4.0
tzdata==2023.3
uri-template==1.2.0
urllib3==1.25.8
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.2
Werkzeug==2.3.4
xgboost==1.7.5
xxhash==3.3.0
yarl==1.9.2
zipp==3.15.0

运行日志或截图

[INFO|deepspeed.py:381] 2023-08-21 19:48:45,175 >> Attempting to resume from /hy-tmp/result/pt/checkpoint-156
[2023-08-21 19:48:45,176] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /hy-tmp/result/pt/checkpoint-156/global_step156/mp_rank_00_model_states.pt...
[2023-08-21 19:48:51,071] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /hy-tmp/result/pt/checkpoint-156/global_step156/mp_rank_00_model_states.pt.
[2023-08-21 19:48:51,646] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /hy-tmp/result/pt/checkpoint-156/global_step156/mp_rank_00_model_states.pt...
[2023-08-21 19:48:57,443] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /hy-tmp/result/pt/checkpoint-156/global_step156/mp_rank_00_model_states.pt.
Traceback (most recent call last):
  File "run_clm_pt_with_peft.py", line 637, in <module>
    main()
  File "run_clm_pt_with_peft.py", line 605, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1676, in _inner_training_loop
    deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/transformers/deepspeed.py", line 383, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2648, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2707, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "/usr/local/miniconda3/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2511, in load_module_state_dict
    self.module.load_state_dict(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        Missing key(s) in state_dict: "base_model.model.model.layers.0.self_attn.q_proj.weight", "base_model.model.model.layers.0.self_attn.k_proj.weight", "base_model.model.model.layers.0.self_attn.v_proj.weight", "base_model.model.model.layers.0.self_attn.o_proj.weight", "base_model.model.model.layers.0.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.0.mlp.gate_proj.weight", "base_model.model.model.layers.0.mlp.up_proj.weight", "base_model.model.model.layers.0.mlp.down_proj.weight", "base_model.model.model.layers.0.input_layernorm.weight", "base_model.model.model.layers.0.post_attention_layernorm.weight", "base_model.model.model.layers.1.self_attn.q_proj.weight", "base_model.model.model.layers.1.self_attn.k_proj.weight", "base_model.model.model.layers.1.self_attn.v_proj.weight", "base_model.model.model.layers.1.self_attn.o_proj.weight", "base_model.model.model.layers.1.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.1.mlp.gate_proj.weight", "base_model.model.model.layers.1.mlp.up_proj.weight", "base_model.model.model.layers.1.mlp.down_proj.weight", "base_model.model.model.layers.1.input_layernorm.weight", "base_model.model.model.layers.1.post_attention_layernorm.weight", "base_model.model.model.layers.2.self_attn.q_proj.weight", "base_model.model.model.layers.2.self_attn.k_proj.weight", "base_model.model.model.layers.2.self_attn.v_proj.weight", "base_model.model.model.layers.2.self_attn.o_proj.weight", "base_model.model.model.layers.2.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.2.mlp.gate_proj.weight", "base_model.model.model.layers.2.mlp.up_proj.weight", "base_model.model.model.layers.2.mlp.down_proj.weight", "base_model.model.model.layers.2.input_layernorm.weight", "base_model.model.model.layers.2.post_attention_layernorm.weight", "base_model.model.model.layers.3.self_attn.q_proj.weight", "base_model.model.model.layers.3.self_attn.k_proj.weight", "base_model.model.model.layers.3.self_attn.v_proj.weight", "base_model.model.model.layers.3.self_attn.o_proj.weight", "base_model.model.model.layers.3.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.3.mlp.gate_proj.weight", "base_model.model.model.layers.3.mlp.up_proj.weight", "base_model.model.model.layers.3.mlp.down_proj.weight", "base_model.model.model.layers.3.input_layernorm.weight", "base_model.model.model.layers.3.post_attention_layernorm.weight", "base_model.model.model.layers.4.self_attn.q_proj.weight", "base_model.model.model.layers.4.self_attn.k_proj.weight", "base_model.model.model.layers.4.self_attn.v_proj.weight", "base_model.model.model.layers.4.self_attn.o_proj.weight", "base_model.model.model.layers.4.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.4.mlp.gate_proj.weight", "base_model.model.model.layers.4.mlp.up_proj.weight", "base_model.model.model.layers.4.mlp.down_proj.weight", "base_model.model.model.layers.4.input_layernorm.weight", "base_model.model.model.layers.4.post_attention_layernorm.weight", "base_model.model.model.layers.5.self_attn.q_proj.weight", "base_model.model.model.layers.5.self_attn.k_proj.weight", "base_model.model.model.layers.5.self_attn.v_proj.weight", "base_model.model.model.layers.5.self_attn.o_proj.weight", "base_model.model.model.layers.5.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.5.mlp.gate_proj.weight", "base_model.model.model.layers.5.mlp.up_proj.weight", "base_model.model.model.layers.5.mlp.down_proj.weight", "base_model.model.model.layers.5.input_layernorm.weight", "base_model.model.model.layers.5.post_attention_layernorm.weight", "base_model.model.model.layers.6.self_attn.q_proj.weight", "base_model.model.model.layers.6.self_attn.k_proj.weight", "base_model.model.model.layers.6.self_attn.v_proj.weight", "base_model.model.model.layers.6.self_attn.o_proj.weight", "base_model.model.model.layers.6.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.6.mlp.gate_proj.weight", "base_model.model.model.layers.6.mlp.up_proj.weight", "base_model.model.model.layers.6.mlp.down_proj.weight", "base_model.model.model.layers.6.input_layernorm.weight", "base_model.model.model.layers.6.post_attention_layernorm.weight", "base_model.model.model.layers.7.self_attn.q_proj.weight", "base_model.model.model.layers.7.self_attn.k_proj.weight", "base_model.model.model.layers.7.self_attn.v_proj.weight", "base_model.model.model.layers.7.self_attn.o_proj.weight", "base_model.model.model.layers.7.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.7.mlp.gate_proj.weight", "base_model.model.model.layers.7.mlp.up_proj.weight", "base_model.model.model.layers.7.mlp.down_proj.weight", "base_model.model.model.layers.7.input_layernorm.weight", "base_model.model.model.layers.7.post_attention_layernorm.weight", "base_model.model.model.layers.8.self_attn.q_proj.weight", "base_model.model.model.layers.8.self_attn.k_proj.weight", "base_model.model.model.layers.8.self_attn.v_proj.weight", "base_model.model.model.layers.8.self_attn.o_proj.weight", "base_model.model.model.layers.8.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.8.mlp.gate_proj.weight", "base_model.model.model.layers.8.mlp.up_proj.weight", "base_model.model.model.layers.8.mlp.down_proj.weight", "base_model.model.model.layers.8.input_layernorm.weight", "base_model.model.model.layers.8.post_attention_layernorm.weight", "base_model.model.model.layers.9.self_attn.q_proj.weight", "base_model.model.model.layers.9.self_attn.k_proj.weight", "base_model.model.model.layers.9.self_attn.v_proj.weight", "base_model.model.model.layers.9.self_attn.o_proj.weight", "base_model.model.model.layers.9.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.9.mlp.gate_proj.weight", "base_model.model.model.layers.9.mlp.up_proj.weight", "base_model.model.model.layers.9.mlp.down_proj.weight", "base_model.model.model.layers.9.input_layernorm.weight", "base_model.model.model.layers.9.post_attention_layernorm.weight", "base_model.model.model.layers.10.self_attn.q_proj.weight", "base_model.model.model.layers.10.self_attn.k_proj.weight", "base_model.model.model.layers.10.self_attn.v_proj.weight", "base_model.model.model.layers.10.self_attn.o_proj.weight", "base_model.model.model.layers.10.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.10.mlp.gate_proj.weight", "base_model.model.model.layers.10.mlp.up_proj.weight", "base_model.model.model.layers.10.mlp.down_proj.weight", "base_model.model.model.layers.10.input_layernorm.weight", "base_model.model.model.layers.10.post_attention_layernorm.weight", "base_model.model.model.layers.11.self_attn.q_proj.weight", "base_model.model.model.layers.11.self_attn.k_proj.weight", "base_model.model.model.layers.11.self_attn.v_proj.weight", "base_model.model.model.layers.11.self_attn.o_proj.weight", "base_model.model.model.layers.11.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.11.mlp.gate_proj.weight", "base_model.model.model.layers.11.mlp.up_proj.weight", "base_model.model.model.layers.11.mlp.down_proj.weight", "base_model.model.model.layers.11.input_layernorm.weight", "base_model.model.model.layers.11.post_attention_layernorm.weight", "base_model.model.model.layers.12.self_attn.q_proj.weight", "base_model.model.model.layers.12.self_attn.k_proj.weight", "base_model.model.model.layers.12.self_attn.v_proj.weight", "base_model.model.model.layers.12.self_attn.o_proj.weight", "base_model.model.model.layers.12.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.12.mlp.gate_proj.weight", "base_model.model.model.layers.12.mlp.up_proj.weight", "base_model.model.model.layers.12.mlp.down_proj.weight", "base_model.model.model.layers.12.input_layernorm.weight", "base_model.model.model.layers.12.post_attention_layernorm.weight", "base_model.model.model.layers.13.self_attn.q_proj.weight", "base_model.model.model.layers.13.self_attn.k_proj.weight", "base_model.model.model.layers.13.self_attn.v_proj.weight", "base_model.model.model.layers.13.self_attn.o_proj.weight", "base_model.model.model.layers.13.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.13.mlp.gate_proj.weight", "base_model.model.model.layers.13.mlp.up_proj.weight", "base_model.model.model.layers.13.mlp.down_proj.weight", "base_model.model.model.layers.13.input_layernorm.weight", "base_model.model.model.layers.13.post_attention_layernorm.weight", "base_model.model.model.layers.14.self_attn.q_proj.weight", "base_model.model.model.layers.14.self_attn.k_proj.weight", "base_model.model.model.layers.14.self_attn.v_proj.weight", "base_model.model.model.layers.14.self_attn.o_proj.weight", "base_model.model.model.layers.14.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.14.mlp.gate_proj.weight", "base_model.model.model.layers.14.mlp.up_proj.weight", "base_model.model.model.layers.14.mlp.down_proj.weight", "base_model.model.model.layers.14.input_layernorm.weight", "base_model.model.model.layers.14.post_attention_layernorm.weight", "base_model.model.model.layers.15.self_attn.q_proj.weight", "base_model.model.model.layers.15.self_attn.k_proj.weight", "base_model.model.model.layers.15.self_attn.v_proj.weight", "base_model.model.model.layers.15.self_attn.o_proj.weight", "base_model.model.model.layers.15.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.15.mlp.gate_proj.weight", "base_model.model.model.layers.15.mlp.up_proj.weight", "base_model.model.model.layers.15.mlp.down_proj.weight", "base_model.model.model.layers.15.input_layernorm.weight", "base_model.model.model.layers.15.post_attention_layernorm.weight", "base_model.model.model.layers.16.self_attn.q_proj.weight", "base_model.model.model.layers.16.self_attn.k_proj.weight", "base_model.model.model.layers.16.self_attn.v_proj.weight", "base_model.model.model.layers.16.self_attn.o_proj.weight", "base_model.model.model.layers.16.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.16.mlp.gate_proj.weight", "base_model.model.model.layers.16.mlp.up_proj.weight", "base_model.model.model.layers.16.mlp.down_proj.weight", "base_model.model.model.layers.16.input_layernorm.weight", "base_model.model.model.layers.16.post_attention_layernorm.weight", "base_model.model.model.layers.17.self_attn.q_proj.weight", "base_model.model.model.layers.17.self_attn.k_proj.weight", "base_model.model.model.layers.17.self_attn.v_proj.weight", "base_model.model.model.layers.17.self_attn.o_proj.weight", "base_model.model.model.layers.17.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.17.mlp.gate_proj.weight", "base_model.model.model.layers.17.mlp.up_proj.weight", "base_model.model.model.layers.17.mlp.down_proj.weight", "base_model.model.model.layers.17.input_layernorm.weight", "base_model.model.model.layers.17.post_attention_layernorm.weight", "base_model.model.model.layers.18.self_attn.q_proj.weight", "base_model.model.model.layers.18.self_attn.k_proj.weight", "base_model.model.model.layers.18.self_attn.v_proj.weight", "base_model.model.model.layers.18.self_attn.o_proj.weight", "base_model.model.model.layers.18.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.18.mlp.gate_proj.weight", "base_model.model.model.layers.18.mlp.up_proj.weight", "base_model.model.model.layers.18.mlp.down_proj.weight", "base_model.model.model.layers.18.input_layernorm.weight", "base_model.model.model.layers.18.post_attention_layernorm.weight", "base_model.model.model.layers.19.self_attn.q_proj.weight", "base_model.model.model.layers.19.self_attn.k_proj.weight", "base_model.model.model.layers.19.self_attn.v_proj.weight", "base_model.model.model.layers.19.self_attn.o_proj.weight", "base_model.model.model.layers.19.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.19.mlp.gate_proj.weight", "base_model.model.model.layers.19.mlp.up_proj.weight", "base_model.model.model.layers.19.mlp.down_proj.weight", "base_model.model.model.layers.19.input_layernorm.weight", "base_model.model.model.layers.19.post_attention_layernorm.weight", "base_model.model.model.layers.20.self_attn.q_proj.weight", "base_model.model.model.layers.20.self_attn.k_proj.weight", "base_model.model.model.layers.20.self_attn.v_proj.weight", "base_model.model.model.layers.20.self_attn.o_proj.weight", "base_model.model.model.layers.20.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.20.mlp.gate_proj.weight", "base_model.model.model.layers.20.mlp.up_proj.weight", "base_model.model.model.layers.20.mlp.down_proj.weight", "base_model.model.model.layers.20.input_layernorm.weight", "base_model.model.model.layers.20.post_attention_layernorm.weight", "base_model.model.model.layers.21.self_attn.q_proj.weight", "base_model.model.model.layers.21.self_attn.k_proj.weight", "base_model.model.model.layers.21.self_attn.v_proj.weight", "base_model.model.model.layers.21.self_attn.o_proj.weight", "base_model.model.model.layers.21.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.21.mlp.gate_proj.weight", "base_model.model.model.layers.21.mlp.up_proj.weight", "base_model.model.model.layers.21.mlp.down_proj.weight", "base_model.model.model.layers.21.input_layernorm.weight", "base_model.model.model.layers.21.post_attention_layernorm.weight", "base_model.model.model.layers.22.self_attn.q_proj.weight", "base_model.model.model.layers.22.self_attn.k_proj.weight", "base_model.model.model.layers.22.self_attn.v_proj.weight", "base_model.model.model.layers.22.self_attn.o_proj.weight", "base_model.model.model.layers.22.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.22.mlp.gate_proj.weight", "base_model.model.model.layers.22.mlp.up_proj.weight", "base_model.model.model.layers.22.mlp.down_proj.weight", "base_model.model.model.layers.22.input_layernorm.weight", "base_model.model.model.layers.22.post_attention_layernorm.weight", "base_model.model.model.layers.23.self_attn.q_proj.weight", "base_model.model.model.layers.23.self_attn.k_proj.weight", "base_model.model.model.layers.23.self_attn.v_proj.weight", "base_model.model.model.layers.23.self_attn.o_proj.weight", "base_model.model.model.layers.23.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.23.mlp.gate_proj.weight", "base_model.model.model.layers.23.mlp.up_proj.weight", "base_model.model.model.layers.23.mlp.down_proj.weight", "base_model.model.model.layers.23.input_layernorm.weight", "base_model.model.model.layers.23.post_attention_layernorm.weight", "base_model.model.model.layers.24.self_attn.q_proj.weight", "base_model.model.model.layers.24.self_attn.k_proj.weight", "base_model.model.model.layers.24.self_attn.v_proj.weight", "base_model.model.model.layers.24.self_attn.o_proj.weight", "base_model.model.model.layers.24.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.24.mlp.gate_proj.weight", "base_model.model.model.layers.24.mlp.up_proj.weight", "base_model.model.model.layers.24.mlp.down_proj.weight", "base_model.model.model.layers.24.input_layernorm.weight", "base_model.model.model.layers.24.post_attention_layernorm.weight", "base_model.model.model.layers.25.self_attn.q_proj.weight", "base_model.model.model.layers.25.self_attn.k_proj.weight", "base_model.model.model.layers.25.self_attn.v_proj.weight", "base_model.model.model.layers.25.self_attn.o_proj.weight", "base_model.model.model.layers.25.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.25.mlp.gate_proj.weight", "base_model.model.model.layers.25.mlp.up_proj.weight", "base_model.model.model.layers.25.mlp.down_proj.weight", "base_model.model.model.layers.25.input_layernorm.weight", "base_model.model.model.layers.25.post_attention_layernorm.weight", "base_model.model.model.layers.26.self_attn.q_proj.weight", "base_model.model.model.layers.26.self_attn.k_proj.weight", "base_model.model.model.layers.26.self_attn.v_proj.weight", "base_model.model.model.layers.26.self_attn.o_proj.weight", "base_model.model.model.layers.26.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.26.mlp.gate_proj.weight", "base_model.model.model.layers.26.mlp.up_proj.weight", "base_model.model.model.layers.26.mlp.down_proj.weight", "base_model.model.model.layers.26.input_layernorm.weight", "base_model.model.model.layers.26.post_attention_layernorm.weight", "base_model.model.model.layers.27.self_attn.q_proj.weight", "base_model.model.model.layers.27.self_attn.k_proj.weight", "base_model.model.model.layers.27.self_attn.v_proj.weight", "base_model.model.model.layers.27.self_attn.o_proj.weight", "base_model.model.model.layers.27.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.27.mlp.gate_proj.weight", "base_model.model.model.layers.27.mlp.up_proj.weight", "base_model.model.model.layers.27.mlp.down_proj.weight", "base_model.model.model.layers.27.input_layernorm.weight", "base_model.model.model.layers.27.post_attention_layernorm.weight", "base_model.model.model.layers.28.self_attn.q_proj.weight", "base_model.model.model.layers.28.self_attn.k_proj.weight", "base_model.model.model.layers.28.self_attn.v_proj.weight", "base_model.model.model.layers.28.self_attn.o_proj.weight", "base_model.model.model.layers.28.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.28.mlp.gate_proj.weight", "base_model.model.model.layers.28.mlp.up_proj.weight", "base_model.model.model.layers.28.mlp.down_proj.weight", "base_model.model.model.layers.28.input_layernorm.weight", "base_model.model.model.layers.28.post_attention_layernorm.weight", "base_model.model.model.layers.29.self_attn.q_proj.weight", "base_model.model.model.layers.29.self_attn.k_proj.weight", "base_model.model.model.layers.29.self_attn.v_proj.weight", "base_model.model.model.layers.29.self_attn.o_proj.weight", "base_model.model.model.layers.29.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.29.mlp.gate_proj.weight", "base_model.model.model.layers.29.mlp.up_proj.weight", "base_model.model.model.layers.29.mlp.down_proj.weight", "base_model.model.model.layers.29.input_layernorm.weight", "base_model.model.model.layers.29.post_attention_layernorm.weight", "base_model.model.model.layers.30.self_attn.q_proj.weight", "base_model.model.model.layers.30.self_attn.k_proj.weight", "base_model.model.model.layers.30.self_attn.v_proj.weight", "base_model.model.model.layers.30.self_attn.o_proj.weight", "base_model.model.model.layers.30.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.30.mlp.gate_proj.weight", "base_model.model.model.layers.30.mlp.up_proj.weight", "base_model.model.model.layers.30.mlp.down_proj.weight", "base_model.model.model.layers.30.input_layernorm.weight", "base_model.model.model.layers.30.post_attention_layernorm.weight", "base_model.model.model.layers.31.self_attn.q_proj.weight", "base_model.model.model.layers.31.self_attn.k_proj.weight", "base_model.model.model.layers.31.self_attn.v_proj.weight", "base_model.model.model.layers.31.self_attn.o_proj.weight", "base_model.model.model.layers.31.self_attn.rotary_emb.inv_freq", "base_model.model.model.layers.31.mlp.gate_proj.weight", "base_model.model.model.layers.31.mlp.up_proj.weight", "base_model.model.model.layers.31.mlp.down_proj.weight", "base_model.model.model.layers.31.input_layernorm.weight", "base_model.model.model.layers.31.post_attention_layernorm.weight", "base_model.model.model.norm.weight". 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4986) of binary: /usr/local/miniconda3/bin/python
Traceback (most recent call last):
  File "/usr/local/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-21_19:49:02
  host      : I1496abdf480080174e
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4986)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

ymcui / Chinese-LLaMA-Alpaca-2

使用resume_from_checkpoint参数继续预训练出错 #161

提交前必须检查以下项目

问题类型

基础模型

操作系统

详细描述问题

依赖情况（代码类问题务必提供）

运行日志或截图