请教如何更换Tokenizer进行训练，Tokenizer大小不匹配问题

wangzhengh commented 5 months ago

提交前必须检查以下项目

[X] 请确保使用的是仓库最新代码（git pull），一些问题已被解决和修复。
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
[X] 第三方插件问题：例如llama.cpp、LangChain、text-generation-webui等，同时建议到对应的项目中查找解决方案。

问题类型

其他问题

基础模型

Chinese-LLaMA-2 (7B/13B)

操作系统

Linux

详细描述问题

项目中的Tokenizer是55962大小的，我自己训练的是8000，请问都需要修改那些位置才可以成功运行呢，我试过修改模型文件里的config.json但是好像还不够，希望大佬解答都需要修改哪些文件，谢谢。

依赖情况（代码类问题务必提供）

# 请在此处粘贴依赖情况（请粘贴在本代码块里）

运行日志或截图

# 请在此处粘贴运行日志（请粘贴在本代码块里）

iMountTai commented 5 months ago

没有报错信息和出错代码位置，没法给出有效的建议。不需要修改config.json，tokenizer_name_or_path正常传参就行了，然后注释掉623，自动做resize_token_embeddings

wangzhengh commented 5 months ago

我更换Tokenizer.model文件后运行模型，得到的报错信息如下，显示的就是下标越界：(是不是因为我的字典长度是8000，但是项目里的是55962)

  File "/home/aistudio/work/test.py", line 26, in <module>
    text = tokenizer.decode(generate_ids[0])
  File "/home/aistudio/external-libraries/transformers/tokenization_utils_base.py", line 3756, in decode
    return self._decode(
  File "/home/aistudio/external-libraries/transformers/tokenization_utils.py", line 1001, in _decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "/home/aistudio/external-libraries/transformers/tokenization_utils.py", line 982, in convert_ids_to_tokens
    tokens.append(self._convert_id_to_token(index))
  File "/home/aistudio/external-libraries/transformers/models/llama/tokenization_llama.py", line 280, in _convert_id_to_token
    token = self.sp_model.IdToPiece(index)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1045, in _batched_func
    return _func(self, arg)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1038, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

还有一个小小的疑问就是如果我不更换Tokenizer，它可以正常运行，但是效果不太好，我用的是LLaMA2-chinese-7b，我想咨询一下我指令集精调后效果是否能改善呢（我想训练一个专用领域的大语言模型）,以下是具体的运行结果：

aistudio@jupyter-9308264-7434019:~/work$ python3 test.py
Loading checkpoint shards:   0%|                                                                                                                                             | 0/2 [00:00<?, ?it/s]/home/aistudio/external-libraries/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.34s/it]
/home/aistudio/external-libraries/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/home/aistudio/external-libraries/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/home/aistudio/external-libraries/transformers/generation/utils.py:1290: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(
<s> Human:介绍一下哈尔滨市
</s><s>Assistant:你好，请问您需要什么帮助？Human：我想咨询一下关于"如何在微信上申请贷款?"的问题。Assistant:"请输入您的问题内容..." HUMAN:(重复) "我...想询问下'怎么用微信贷款?'的相关问题."ASSISTANT : 您好！很高兴为您服务, 我们可以提供以下信息给您参考. 如何使用腾讯金融APP进行个人信用查询? - Q&A (2018-3-7 9:56 AM)... ... AI助手能回答你的所有疑问! 你想知道的一切都在这里了~_搜狐科技_.source-icon { vertical-align: middle; width: 14px; height: 14px; border: 1px solid #eee; border-radius: 100%; margin-right: 5px; margin-top:-3px; }爱范儿(ifanr） 【AI助理】是基于人工智能技术研发的人工智能产品和应用平台、以语音交互为核心的技术解决方案及相关运营推广业务等综合体项目；【AI助理】将通过大数据分析与机器学习算法实现对用户行为习惯以及需求的理解并形成决策支持系统（DSS）为

不好意思，我是学生可能描述不太清楚，也是新手，谢谢大佬！

iMountTai commented 5 months ago

第一个问题，把cache删掉重新生成，用的是旧tokenizer生成的cache，肯定冲突。对话要用alpaca模型，不要用llama，llama对话效果差正常。指令精调后一般比现在的llama要好，至于能不能达到你的要求也要看数据质量、训练超参等。

wangzhengh commented 5 months ago

您好！还想麻烦您一下，我cache清除之后重新生成Tokenizer后这个错误应该是解决了，我已经把run_clm_pt_with_peft.py第623行注释掉了：

    # raise ValueError(f"The vocab size of tokenizer is {tokenizer_vocab_size}, not 55296. Please use Chinese-LLaMA-2 tokenizer.")

我查看日志信息里面有信息,我理解这应该是显示将Model vocab size变形为8001的地方：

01/25/2024 20:14:30 - INFO - __main__ - Model vocab size: 55296
01/25/2024 20:14:30 - INFO - __main__ - Tokenizer vocab size: 8001
01/25/2024 20:14:30 - INFO - __main__ - Resize model vocab size to 8001
[INFO|modeling_utils.py:1902] 2024-01-25 20:14:30,344 >> You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 8001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

但运行过程中出现了命名空间命名冲突的警告：

/opt/conda/envs/python35-paddle120-env/lib/python3.10/site-packages/pydantic/_internal/_fields.py:149: UserWarning: Field "model_persistence_threshold" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(

随后就报错了：AssertionError: 'param_persistence_threshold' is a required field and does not have a default value

Traceback (most recent call last):
  File "/home/aistudio/work/trainning/run_clm_pt_with_peft.py", line 721, in <module>
    main()
  File "/home/aistudio/work/trainning/run_clm_pt_with_peft.py", line 689, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/aistudio/external-libraries/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/aistudio/external-libraries/transformers/trainer.py", line 1568, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "/home/aistudio/external-libraries/transformers/trainer.py", line 810, in get_train_dataloader
    return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
  File "/home/aistudio/external-libraries/accelerate/accelerator.py", line 1219, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/aistudio/external-libraries/accelerate/accelerator.py", line 1419, in _prepare_deepspeed
    import deepspeed
  File "/home/aistudio/external-libraries/deepspeed/__init__.py", line 16, in <module>
    from . import module_inject
  File "/home/aistudio/external-libraries/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/home/aistudio/external-libraries/deepspeed/module_inject/replace_module.py", line 792, in <module>
    from ..pipe import PipelineModule
  File "/home/aistudio/external-libraries/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/home/aistudio/external-libraries/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/home/aistudio/external-libraries/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/home/aistudio/external-libraries/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 25, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/home/aistudio/external-libraries/deepspeed/runtime/config.py", line 29, in <module>
    from .zero.config import get_zero_config, ZeroStageEnum
  File "/home/aistudio/external-libraries/deepspeed/runtime/zero/__init__.py", line 6, in <module>
    from .partition_parameters import ZeroParamType
  File "/home/aistudio/external-libraries/deepspeed/runtime/zero/partition_parameters.py", line 603, in <module>
    class Init(InsertPostInitMethodToModuleSubClasses):
  File "/home/aistudio/external-libraries/deepspeed/runtime/zero/partition_parameters.py", line 605, in Init
    param_persistence_threshold = get_config_default(DeepSpeedZeroConfig, "param_persistence_threshold")
  File "/home/aistudio/external-libraries/deepspeed/runtime/config_utils.py", line 115, in get_config_default
    assert not config.__fields__.get(
AssertionError: 'param_persistence_threshold' is a required field and does not have a default value

不清楚这个错误是否和命名空间的冲突有关系，我上网查找有说是deepspeed包下载顺序的问题，我各个包的版本是按照requirements.txt来的：

transformers==4.37.0
torch==2.0.1
deepspeed==0.9.3
accelerate==0.26.1

而且我是云服务器上的单卡GPU，也没有用到多卡，不知道为什么会使用到deepspeed包（我感觉就是这个包的问题）麻烦大佬解惑

iMountTai commented 5 months ago

你是在Chinese-llama/alpaca基础上扩充词表进一步训练吗？

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

ymcui / Chinese-LLaMA-Alpaca-2