Closed panpanli521 closed 9 months ago
pt_lora_model
下的adapter_model.bin大小是多少,可以load一下验证是否正常保存了可训练参数的权重。
rank0~4的adapter_model.bin都是21M大小,但md5值不同,加载其中一个adapter_model.bin的参数会报shape mismatch的错,
size mismatch for base_model.model.model.layers.77.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
size mismatch for base_model.model.model.layers.77.self_attn.q_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8192, 64]).
size mismatch for base_model.model.model.layers.77.self_attn.k_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
size mismatch for base_model.model.model.layers.77.self_attn.v_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
size mismatch for base_model.model.model.layers.77.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
size mismatch for base_model.model.model.layers.77.self_attn.o_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8192, 64]).
size mismatch for base_model.model.model.layers.77.mlp.gate_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
size mismatch for base_model.model.model.layers.77.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([28672, 64]).
size mismatch for base_model.model.model.layers.77.mlp.up_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 8192]).
size mismatch for base_model.model.model.layers.77.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([28672, 64]).
size mismatch for base_model.model.model.layers.77.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 28672]).
size mismatch for base_model.model.model.layers.77.mlp.down_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([8192, 64]).
好奇怪,adapter_model.bin里很多lora_A和lora_B的参数都是空的,请问这种情况正常吗?
肯定是不正常的,训练中权重保存应该有问题
肯定是不正常的,训练中权重保存应该有问题
知道为啥了,output_dir下有两个adapter_model.bin,一个在pt_lora_model/下,大小是21M,一个跟pt_lora_model同级目录,大小是3.3G,我之前加载的一直是pt_lora_model/下的adapter_model.bin,现在改成加载3.3G的那个就没问题了。不确定pt_lora_model下21M的文件里保存的是不是lora的初始参数?
看你保存模型的方式来看,应该是魔改了我们的开源代码,建议再调试一下,之前的代码只需要修改from_pretrained
处的device_map
相关参数就可以支持zero3训练了
之前的代码报错:
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
然后我把model = LlamaForCausalLM.from_pretrained里的device_map和low_cpu_mem_usage都去掉了,请问正确的修改方式是怎样的呢
是这么修改的
好的,其他的我好像就没有改啥了
你好,有注意到开启ZeRO3 之后,Model Vocab size 为0么?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.
提交前必须检查以下项目
问题类型
模型转换和合并
基础模型
Chinese-LLaMA-2 (7B/13B)
操作系统
Linux
详细描述问题
其中 ds_config_zero3.json文件配置如下:
训练完成合并参数:
依赖情况(代码类问题务必提供)
运行日志或截图
报错如下:
我理解开启zero3进行lora训练,保存的lora参数应该不是完整的,我这里只拿rank0上的pt_lora_model里的参数做合并应该不太对,不知道理解的是否正确,求指教。
@ymcui