词汇表扩充并且增量训练的具体流程和修改哪些部分？

Shajiu commented 3 months ago

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
[X] 第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Others

操作系统

Linux

详细描述问题

# 请在此处粘贴运行代码（请粘贴在本代码块里）

我想在LLama2上通过扩充满文词汇表来做增量训练，我怎么处理呢？当前，我先用sp训练满文分词器，随后加载此模型并且跟llama词汇表合并，然后用此结果替换llama中的词汇表。随后做增量训练时出现错误？

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况（请粘贴在本代码块里）

运行日志或截图

# 请在此处粘贴运行日志（请粘贴在本代码块里）

ymcui commented 3 months ago

不是替换，是合并词表。建议你阅读我们的论文。一代项目里也有相关细节：https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/训练细节

Shajiu commented 3 months ago

催老师，我是这么做的首先，拿了一批中文数据集用sentencepiece训练得到一个gogpt.model和gogpt.vocab；然后，用sentencepiece分别加载llama2的的词汇和gogpt.mode，并且把gogpt.mode不在llama2词汇表中的加在llama2的后面，合并。生成special_tokens_map.json tokenizer_config.json tokenizer.model 最后，用这个合并的3个替换原先llama2下的对应三个文件。加载进行训练的。此时的词汇表大小为：90894，我在llama2下修改了config.json中的vocab_size:32000 改为vocab_size:90894训练时出现如下错误： ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([90894, 4096])), this look incorrect.

另外我确实读了您的作品文章，但还是这块没弄明白。具体怎么合并以及继续增量训练。

合并的主要代码如下：

for p in chinese_spm.pieces:
    piece = p.piece
    if piece not in llama_spm_tokens_set:
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        Statistics+=1
        llama_spm.pieces.append(new_p) # 将训练的分词模型追加新的token到之前的模型
print(f"New model pieces: {len(llama_spm.pieces)}")
print("最终的长度为:",Statistics)

## Save
output_sp_dir = 'merged_tokenizer_sp'
output_hf_dir = f'merged_tokenizer_hf_{vocab_size}' # the path to save Chinese-LLaMA tokenizer
os.makedirs(output_sp_dir,exist_ok=True)
with open(output_sp_dir+f'/{model_file}', 'wb') as f:
    f.write(llama_spm.SerializeToString())
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+f'/{model_file}')

tokenizer.save_pretrained(output_hf_dir)

ymcui commented 3 months ago

在我们的脚本上，基于原版llama-2 + 扩展词表不需要修改config.json的vocab_size。这里有resize操作：https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/blob/main/scripts/training/run_clm_pt_with_peft.py#L625

Shajiu commented 3 months ago

我想一步一步走个这个流程，合并词汇表后用合并的去替换原llama中对应三个文件后，接下来做post_per_train的话具体做哪些工作呢？应该需要修改config.json的vocab_size吧？然后修改哪些部分呢？

ymcui commented 3 months ago

不是替换tokenizer，我们的脚本里有专门指定中文tokenizer的参数项，你按照教程设置就行了。表格里面写的很清楚了：https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh#支持的训练模式

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 3 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

ymcui / Chinese-LLaMA-Alpaca-2