ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
18.23k stars 1.86k forks source link

ValueError The vocab size of the tokenizer must be 49954, but found 49953 #733

Closed wangjvjie closed 1 year ago

wangjvjie commented 1 year ago

提交前必须检查以下项目

问题类型

None

基础模型

Alpaca-Plus-13B

操作系统

Linux

详细描述问题

我使用最新的仓库预训练代码进行训练没有问题,训练完后我再接着使用预训练出来的peft_lora模型进行sft微调,代码会在:

if (len(tokenizer))!=49954: raise ValueError(f"The vocab size of the tokenizer must be 49954, but found {len(tokenizer)}.\n" "Please use Chinese Alpaca tokenizer!")

这里报错。因为我的(len(tokenizer))是49953,请问这是什么情况呢,我是按照merge_tokenizer的步骤进行的,然后在预训练也能正常训练。如果去除这个判断信息的话,在后面会报这个错: Traceback (most recent call last): File "/home/h3c/pythonProject/GPT-main/GPT/run_sft_finetune.py", line 433, in main() File "/home/h3c/pythonProject/GPT-main/GPT/run_sft_finetune.py", line 357, in main model = PeftModel.from_pretrained(model, training_args.peft_path) File "/home/h3c/anaconda3/envs/pytorch/lib/python3.9/site-packages/peft/peft_model.py", line 161, in from_pretrained model = set_peft_model_state_dict(model, adapters_weights) File "/home/h3c/anaconda3/envs/pytorch/lib/python3.9/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict model.load_state_dict(peft_model_state_dict, strict=False) File "/home/h3c/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([49953, 5120]) from checkpoint, the shape in current model is torch.Size([49954, 5120]). size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([49953, 5120]) from checkpoint, the shape in current model is torch.Size([49954, 5120]). 我的sh脚本是: lr=1e-4 lora_rank=8 lora_alpha=32 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05

pretrained_model=../merged_llama_model/13B-hf/ chinese_tokenizer_path=./merged_tokenizer_hf dataset_dir=../sft_vicuna/ per_device_train_batch_size=1 per_device_eval_batch_size=1 gradient_accumulation_steps=1 output_dir=./sft_lora/ peft_model=./pretrain/pt_lora_model dataset_type="alpaca"

export CUDA_VISIBLE_DEVICES=0,1,2,3 python run_sft_finetune.py \ --model_name_or_path ${pretrained_model} \ --tokenizer_name_or_path ${chinese_tokenizer_path} \ --dataset_dir ${dataset_dir} \ --dataset_type ${dataset_type} \ --validation_split_percentage 0.001 \ --per_device_train_batch_size ${per_device_train_batch_size} \ --per_device_eval_batch_size ${per_device_eval_batch_size} \ --do_train \ --do_eval False \ --seed 662 \ --fp16 False\ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate ${lr} \ --warmup_ratio 0.03 \ --weight_decay 0 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 3 \ --evaluation_strategy steps \ --eval_steps 100 \ --save_steps 200 \ --gradient_accumulation_steps ${gradient_accumulation_steps} \ --preprocessing_num_workers 8 \ --max_seq_length 4096 \ --output_dir ${output_dir} \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank ${lora_rank} \ --lora_alpha ${lora_alpha} \ --trainable ${lora_trainable} \ --modules_to_save ${modules_to_save} \ --lora_dropout ${lora_dropout} \ --torch_dtype float32 \ --peft_path ${peft_model} \ --gradient_checkpointing \ --ddp_find_unused_parameters False

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志

wangjvjie commented 1 year ago

解决了,把resize embedding 放到peft加载之后