ValueError The vocab size of the tokenizer must be 49954, but found 49953

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 由于相关依赖频繁更新，请确保按照Wiki中的相关步骤执行
[X] 我已阅读FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案
[X] 第三方插件问题：例如llama.cpp、text-generation-webui、LlamaChat等，同时建议到对应的项目中查找解决方案
[X] 模型正确性检查：务必检查模型的SHA256.md，模型不对的情况下无法保证效果和正常运行

问题类型

None

基础模型

Alpaca-Plus-13B

操作系统

Linux

详细描述问题

我使用最新的仓库预训练代码进行训练没有问题，训练完后我再接着使用预训练出来的peft_lora模型进行sft微调，代码会在：

if (len(tokenizer))!=49954: raise ValueError(f"The vocab size of the tokenizer must be 49954, but found {len(tokenizer)}.\n" "Please use Chinese Alpaca tokenizer!")

这里报错。因为我的(len(tokenizer))是49953，请问这是什么情况呢，我是按照merge_tokenizer的步骤进行的，然后在预训练也能正常训练。如果去除这个判断信息的话，在后面会报这个错： Traceback (most recent call last): File "/home/h3c/pythonProject/GPT-main/GPT/run_sft_finetune.py", line 433, in main() File "/home/h3c/pythonProject/GPT-main/GPT/run_sft_finetune.py", line 357, in main model = PeftModel.from_pretrained(model, training_args.peft_path) File "/home/h3c/anaconda3/envs/pytorch/lib/python3.9/site-packages/peft/peft_model.py", line 161, in from_pretrained model = set_peft_model_state_dict(model, adapters_weights) File "/home/h3c/anaconda3/envs/pytorch/lib/python3.9/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict model.load_state_dict(peft_model_state_dict, strict=False) File "/home/h3c/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM: size mismatch for base_model.model.model.embed_tokens.weight: copying a param with shape torch.Size([49953, 5120]) from checkpoint, the shape in current model is torch.Size([49954, 5120]). size mismatch for base_model.model.lm_head.weight: copying a param with shape torch.Size([49953, 5120]) from checkpoint, the shape in current model is torch.Size([49954, 5120]). 我的sh脚本是： lr=1e-4 lora_rank=8 lora_alpha=32 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05

pretrained_model=../merged_llama_model/13B-hf/ chinese_tokenizer_path=./merged_tokenizer_hf dataset_dir=../sft_vicuna/ per_device_train_batch_size=1 per_device_eval_batch_size=1 gradient_accumulation_steps=1 output_dir=./sft_lora/ peft_model=./pretrain/pt_lora_model dataset_type="alpaca"

export CUDA_VISIBLE_DEVICES=0,1,2,3 python run_sft_finetune.py \ --model_name_or_path ${pretrained_model} \ --tokenizer_name_or_path ${chinese_tokenizer_path} \ --dataset_dir ${dataset_dir} \ --dataset_type ${dataset_type} \ --validation_split_percentage 0.001 \ --per_device_train_batch_size ${per_device_train_batch_size} \ --per_device_eval_batch_size ${per_device_eval_batch_size} \ --do_train \ --do_eval False \ --seed 662 \ --fp16 False\ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate ${lr} \ --warmup_ratio 0.03 \ --weight_decay 0 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 3 \ --evaluation_strategy steps \ --eval_steps 100 \ --save_steps 200 \ --gradient_accumulation_steps ${gradient_accumulation_steps} \ --preprocessing_num_workers 8 \ --max_seq_length 4096 \ --output_dir ${output_dir} \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank ${lora_rank} \ --lora_alpha ${lora_alpha} \ --trainable ${lora_trainable} \ --modules_to_save ${modules_to_save} \ --lora_dropout ${lora_dropout} \ --torch_dtype float32 \ --peft_path ${peft_model} \ --gradient_checkpointing \ --ddp_find_unused_parameters False

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况

无

运行日志或截图

# 请在此处粘贴运行日志

无

ymcui / Chinese-LLaMA-Alpaca