lora参数合并，报错The vocab size of the tokenizer 55296 does not match the vocab size of the LoRA weight 0!

panpanli521 commented 10 months ago

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
[X] 第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

模型转换和合并

基础模型

Chinese-LLaMA-2 (7B/13B)

操作系统

Linux

详细描述问题

lora训练参数配置如下：
pretrained_model=Llama-2-70b-hf
chinese_tokenizer_path=chinese-llama-2-13b/
dataset_dir=/datasets/WuDaoCorpus2.0_base_200G
data_cache=/datasets/WuDaoCorpus2.0_base_200G_cache
per_device_train_batch_size=2
gradient_accumulation_steps=4
block_size=4096
output_dir=/Llama-2-70b-hf_saved/
deepspeed_config_file=ds_config_zero3.json

WORLD_SIZE=4
GPUS_PER_NODE=8
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $WORLD_SIZE --node_rank $RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --do_train \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 2 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size ${block_size} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --modules_to_save ${modules_to_save} \
    --torch_dtype float16 \
    --load_in_kbits 16 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False

其中 ds_config_zero3.json文件配置如下：

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "fp16_opt_level": "O2"
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "last_batch_iteration": -1,
            "total_num_steps": "auto",
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

训练完成合并参数：

python scripts/merge_llama2_with_chinese_lora_low_mem.py \
    --base_model  Llama-2-70b-hf
    --lora_model   /Llama-2-70b-hf_saved/rank_0/checkpoint-200/pt_lora_model/
    --output_type huggingface \
    --output_dir   Llama-2-70b-hf_merged_lora

依赖情况（代码类问题务必提供）

bitsandbytes            0.41.0
peft                    0.6.0.dev0
pytorch-triton          2.1.0+440fd1bf20
sentence-transformers   2.2.2
sentencepiece           0.1.97
torch                   2.1.0.dev20230621+cu117
torchaudio              2.0.2
torchdata               0.6.1
torchelastic            0.2.2
torchtext               0.15.2
torchvision             0.15.2
transformers            4.35.0

运行日志或截图

报错如下：

Chinese-LLaMA-Alpaca-2-main/scripts/merge_llama2_with_chinese_lora_low_mem.py", line 245, in <module>
    assert lora_vocab_size==len(tokenizer), \
AssertionError: The vocab size of the tokenizer 55296 does not match the vocab size of the LoRA weight 0!

我理解开启zero3进行lora训练，保存的lora参数应该不是完整的，我这里只拿rank0上的pt_lora_model里的参数做合并应该不太对，不知道理解的是否正确，求指教。

@ymcui

iMountTai commented 10 months ago

pt_lora_model下的adapter_model.bin大小是多少，可以load一下验证是否正常保存了可训练参数的权重。

panpanli521 commented 10 months ago

rank0~4的adapter_model.bin都是21M大小，但md5值不同，加载其中一个adapter_model.bin的参数会报shape mismatch的错，

size mismatch for base_model.model.model.layers.77.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
    size mismatch for base_model.model.model.layers.77.self_attn.q_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8192, 64]).
    size mismatch for base_model.model.model.layers.77.self_attn.k_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
    size mismatch for base_model.model.model.layers.77.self_attn.v_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
    size mismatch for base_model.model.model.layers.77.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
    size mismatch for base_model.model.model.layers.77.self_attn.o_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8192, 64]).
    size mismatch for base_model.model.model.layers.77.mlp.gate_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
    size mismatch for base_model.model.model.layers.77.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([28672, 64]).
    size mismatch for base_model.model.model.layers.77.mlp.up_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
    size mismatch for base_model.model.model.layers.77.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([28672, 64]).
    size mismatch for base_model.model.model.layers.77.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 28672]).
    size mismatch for base_model.model.model.layers.77.mlp.down_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8192, 64]).

好奇怪，adapter_model.bin里很多lora_A和lora_B的参数都是空的，请问这种情况正常吗？

iMountTai commented 10 months ago

肯定是不正常的，训练中权重保存应该有问题

panpanli521 commented 10 months ago

肯定是不正常的，训练中权重保存应该有问题

知道为啥了，output_dir下有两个adapter_model.bin，一个在pt_lora_model/下，大小是21M，一个跟pt_lora_model同级目录，大小是3.3G，我之前加载的一直是pt_lora_model/下的adapter_model.bin，现在改成加载3.3G的那个就没问题了。不确定pt_lora_model下21M的文件里保存的是不是lora的初始参数？

iMountTai commented 10 months ago

看你保存模型的方式来看，应该是魔改了我们的开源代码，建议再调试一下，之前的代码只需要修改from_pretrained处的device_map相关参数就可以支持zero3训练了

panpanli521 commented 10 months ago

之前的代码报错：

ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.

然后我把model = LlamaForCausalLM.from_pretrained里的device_map和low_cpu_mem_usage都去掉了，请问正确的修改方式是怎样的呢

iMountTai commented 10 months ago

是这么修改的

panpanli521 commented 10 months ago

好的，其他的我好像就没有改啥了

q5756578 commented 9 months ago

你好，有注意到开启ZeRO3 之后，Model Vocab size 为0么？

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 9 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

ymcui / Chinese-LLaMA-Alpaca-2