Closed smartparrot closed 1 year ago
模型训练与精调
LLaMA-7B
Linux
运行run_pt.sh进行增量预训练,中断训练后,再重新运行代码,始终都是失败,请问这是怎么回事,基座模型是baichuan
lr=2e-4 lora_rank=8 lora_alpha=32 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" # modules_to_save="embed_tokens,lm_head" lora_dropout=0.05 pretrained_model="/root/LLM/baichuan-7B" #path/to/hf/llama/dir chinese_tokenizer_path="/root/LLM/baichuan-7B" #path/to/chinese/llama/tokenizer/dir dataset_dir="/root/LLM/Chinese-LLaMA-Alpaca/data/mydata" #path/to/pt/data/dir data_cache=temp_data_cache_dir per_device_train_batch_size=1 per_device_eval_batch_size=1 gradient_accumulation_steps=8 output_dir=output_dir deepspeed_config_file=ds_zero2_no_offload.json echo $RANDOM # CUDA_VISIBLE_DEVICES和nproc_per_node数量要对应 # 训练中断重新开始报错,考虑lora CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nnodes 1 --nproc_per_node 3 run_clm_pt_with_peft.py \ --deepspeed ${deepspeed_config_file} \ --model_name_or_path ${pretrained_model} \ --tokenizer_name_or_path ${chinese_tokenizer_path} \ --dataset_dir ${dataset_dir} \ --data_cache_dir ${data_cache} \ --validation_split_percentage 0.01 \ --per_device_train_batch_size ${per_device_train_batch_size} \ --per_device_eval_batch_size ${per_device_eval_batch_size} \ --do_train \ --do_eval \ --evaluation_strategy "epoch" \ --seed $RANDOM \ --fp16 \ --num_train_epochs 100 \ --lr_scheduler_type cosine \ --learning_rate ${lr} \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 3 \ --save_steps 10 \ --gradient_accumulation_steps ${gradient_accumulation_steps} \ --preprocessing_num_workers 8 \ --block_size 512 \ --output_dir ${output_dir} \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank ${lora_rank} \ --lora_alpha ${lora_alpha} \ --trainable ${lora_trainable} \ --lora_dropout ${lora_dropout} \ --torch_dtype float16 \ --ddp_find_unused_parameters False
absl-py 1.4.0 accelerate 0.19.0 aiofiles 23.1.0 aiohttp 3.8.4 aiosignal 1.3.1 altair 4.2.2 anyio 3.7.0 async-timeout 4.0.2 attrs 23.1.0 backports.zoneinfo 0.2.1 beautifulsoup4 4.12.2 bitsandbytes 0.37.1 blinker 1.6.2 bs4 0.0.1 cachetools 5.3.0 certifi 2023.5.7 charset-normalizer 3.1.0 click 8.1.3 cmake 3.26.3 contourpy 1.0.7 cpm-kernels 1.0.11 cycler 0.11.0 datasets 2.10.1 decorator 5.1.1 deepspeed 0.9.2 diffusers 0.16.1 dill 0.3.6 entrypoints 0.4 exceptiongroup 1.1.1 fastapi 0.97.0 ffmpy 0.3.0 filelock 3.12.0 fonttools 4.40.0 frozenlist 1.3.3 fsspec 2023.5.0 gitdb 4.0.10 GitPython 3.1.31 google-auth 2.18.1 google-auth-oauthlib 1.0.0 gradio 3.34.0 gradio_client 0.2.6 grpcio 1.54.2 h11 0.14.0 hjson 3.1.0 httpcore 0.17.2 httpx 0.24.1 huggingface-hub 0.14.1 icetk 0.0.4 idna 3.4 importlib-metadata 6.6.0 importlib-resources 5.12.0 iniconfig 2.0.0 jieba 0.42.1 Jinja2 3.1.2 joblib 1.2.0 jsonschema 4.17.3 kiwisolver 1.4.4 latex2mathml 3.76.0 linkify-it-py 2.0.2 lit 16.0.5 Markdown 3.4.3 markdown-it-py 2.2.0 MarkupSafe 2.1.2 matplotlib 3.7.1 mdit-py-plugins 0.3.3 mdtex2html 1.2.0 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 networkx 3.1 ninja 1.11.1 numpy 1.24.3 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 oauthlib 3.2.2 orjson 3.9.1 packaging 23.1 pandas 2.0.1 peft 0.3.0 Pillow 9.5.0 pip 23.0.1 pkgutil_resolve_name 1.3.10 pluggy 1.0.0 protobuf 3.20.0 psutil 5.9.5 py-cpuinfo 9.0.0 pyarrow 12.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pydantic 1.10.9 pydeck 0.8.1b0 pydub 0.25.1 Pygments 2.15.1 Pympler 1.0.1 pyparsing 3.0.9 pyrsistent 0.19.3 pytest 7.3.1 python-dateutil 2.8.2 python-multipart 0.0.6 pytz 2023.3 PyYAML 6.0 regex 2023.5.5 requests 2.30.0 requests-oauthlib 1.3.1 responses 0.18.0 rich 13.4.2 rouge-chinese 1.0.3 rsa 4.9 safetensors 0.3.1 scikit-learn 1.2.2 scipy 1.10.1 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 67.8.0 six 1.16.0 sklearn 0.0 smmap 5.0.0 sniffio 1.3.0 soupsieve 2.4.1 starlette 0.27.0 streamlit 1.22.0 streamlit-chat 0.0.2.2 sympy 1.12 tenacity 8.2.2 tensorboard 2.13.0 tensorboard-data-server 0.7.0 threadpoolctl 3.1.0 tokenizers 0.13.3 toml 0.10.2 tomli 2.0.1 toolz 0.12.0 torch 1.13.1 torchvision 0.15.2 tornado 6.3.2 tqdm 4.65.0 transformers 4.29.2 triton 2.0.0 typing_extensions 4.5.0 tzdata 2023.3 tzlocal 5.0.1 uc-micro-py 1.0.2 urllib3 1.26.15 uvicorn 0.22.0 validators 0.20.0 watchdog 3.0.0 websockets 11.0.3 Werkzeug 2.3.4 wheel 0.38.4 xxhash 3.2.0 yarl 1.9.2 zipp 3.15.0
[INFO|deepspeed.py:390] 2023-07-12 17:35:55,173 >> Attempting to resume from output_dir/checkpoint-890 [2023-07-12 17:35:55,174] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt... [2023-07-12 17:36:02,326] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt. [2023-07-12 17:36:02,964] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt... [2023-07-12 17:36:09,887] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from output_dir/checkpoint-890/global_step890/mp_rank_00_model_states.pt. ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /root/LLM/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py:651 │ │ in <module> │ │ │ │ 648 │ │ 649 │ │ 650 if __name__ == "__main__": │ │ ❱ 651 │ main() │ │ 652 │ │ │ │ /root/LLM/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py:619 │ │ in main │ │ │ │ 616 │ │ elif last_checkpoint is not None: │ │ 617 │ │ │ checkpoint = last_checkpoint │ │ 618 │ │ print('checkpoint111111--------------',checkpoint) │ │ ❱ 619 │ │ train_result = trainer.train(resume_from_checkpoint=checkpoint │ │ 620 │ │ │ │ 621 │ │ metrics = train_result.metrics │ │ 622 │ │ │ │ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/transformers/trainer. │ │ py:1664 in train │ │ │ │ 1661 │ │ inner_training_loop = find_executable_batch_size( │ │ 1662 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │ │ 1663 │ │ ) │ │ ❱ 1664 │ │ return inner_training_loop( │ │ 1665 │ │ │ args=args, │ │ 1666 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1667 │ │ │ trial=trial, │ │ │ │ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/transformers/trainer. │ │ py:1741 in _inner_training_loop │ │ │ │ 1738 │ │ │ or self.fsdp is not None │ │ 1739 │ │ ) │ │ 1740 │ │ if args.deepspeed: │ │ ❱ 1741 │ │ │ deepspeed_engine, optimizer, lr_scheduler = deepspeed_ini │ │ 1742 │ │ │ │ self, num_training_steps=max_steps, resume_from_check │ │ 1743 │ │ │ ) │ │ 1744 │ │ │ self.model = deepspeed_engine.module │ │ │ │ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/transformers/deepspee │ │ d.py:392 in deepspeed_init │ │ │ │ 389 │ │ if len(deepspeed_checkpoint_dirs) > 0: │ │ 390 │ │ │ logger.info(f"Attempting to resume from {resume_from_check │ │ 391 │ │ │ # this magically updates self.optimizer and self.lr_schedu │ │ ❱ 392 │ │ │ load_path, _ = deepspeed_engine.load_checkpoint( │ │ 393 │ │ │ │ resume_from_checkpoint, load_optimizer_states=True, lo │ │ 394 │ │ │ ) │ │ 395 │ │ │ if load_path is None: │ │ │ │ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/deepspeed/runtime/eng │ │ ine.py:2588 in load_checkpoint │ │ │ │ 2585 │ │ │ # Prepare for checkpoint load by ensuring all parameters │ │ 2586 │ │ │ self.optimizer.checkpoint_event_prologue() │ │ 2587 │ │ │ │ ❱ 2588 │ │ load_path, client_states = self._load_checkpoint(load_dir, │ │ 2589 │ │ │ │ │ │ │ │ │ │ │ │ │ │ tag, │ │ 2590 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_module_ │ │ 2591 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_optimiz │ │ │ │ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/deepspeed/runtime/eng │ │ ine.py:2647 in _load_checkpoint │ │ │ │ 2644 │ │ │ │ │ │ │ │ │ │ │ │ num_experts=self.num_ │ │ 2645 │ │ │ │ │ │ │ │ │ │ │ │ checkpoint_engine=sel │ │ 2646 │ │ if not self.load_universal_checkpoint(): │ │ ❱ 2647 │ │ │ self.load_module_state_dict(checkpoint=checkpoint, │ │ 2648 │ │ │ │ │ │ │ │ │ │ strict=load_module_strict, │ │ 2649 │ │ │ │ │ │ │ │ │ │ custom_load_fn=custom_load_fn │ │ 2650 │ │ │ │ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/deepspeed/runtime/eng │ │ ine.py:2451 in load_module_state_dict │ │ │ │ 2448 │ │ if custom_load_fn: │ │ 2449 │ │ │ custom_load_fn(src=module_state_dict, dst=self.module) │ │ 2450 │ │ else: │ │ ❱ 2451 │ │ │ self.module.load_state_dict( │ │ 2452 │ │ │ │ module_state_dict, # TODO │ │ 2453 │ │ │ │ strict=strict) │ │ 2454 │ │ │ │ /root/anaconda3/envs/glm6b/lib/python3.8/site-packages/torch/nn/modules/modu │ │ le.py:1671 in load_state_dict │ │ │ │ 1668 │ │ │ │ │ │ ', '.join('"{}"'.format(k) for k in missing_k │ │ 1669 │ │ │ │ 1670 │ │ if len(error_msgs) > 0: │ │ ❱ 1671 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {} │ │ 1672 │ │ │ │ │ │ │ self.__class__.__name__, "\n\t".join(e │ │ 1673 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │ │ 1674 │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight", "base_model.model.model.layers.0.self_attn.k_proj.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight", "base_model.model.model.layers.0.self_attn.v_proj.weight", "base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight", "base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight", "base_model.model.model.layers.0.self_attn.o_proj.weight",
使用baichuan的问题建议去对应的repo提问。我们的脚本不一定适配。
提交前必须检查以下项目
问题类型
模型训练与精调
基础模型
LLaMA-7B
操作系统
Linux
详细描述问题
运行run_pt.sh进行增量预训练,中断训练后,再重新运行代码,始终都是失败,请问这是怎么回事,基座模型是baichuan
依赖情况(代码类问题务必提供)
运行日志或截图