make_input和model.weight.tokenizer.encode会产生多余空格问题

ztxz16 / fastllm

纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行

Apache License 2.0

3.28k stars 332 forks source link

make_input和model.weight.tokenizer.encode会产生多余空格问题 #402

Open yiguanxian opened 7 months ago

yiguanxian commented 7 months ago

模型：baichuan2-13B-chat

问题1：复现代码块： In [4]: import pyfastllm In [5]: model = pyfastllm.create_model("baichuan2-int8.flm") In [6]: prompt = model.make_input("", 0, "你好") In [7]: prompt Out[7]: ' 你好' 问题：可以看到使用make_input后在“你好”前多了个空格

问题2：复现代码块： In [7]: model = pyfastllm.create_model("baichuan2-int8.flm") In [8]: prompt = model.make_input("", 0, "你好") In [9]: final_prompt = "这是pre prompt" + prompt In [10]: input_id = model.weight.tokenizer.encode(final_prompt) In [11]: input_id = input_id.to_list() In [12]: input_id = [int(v) for v in input_id] In [13]: input_id Out[13]: [92311, 2691, 4596, 12909, 195, 100030, 92428, 196] In [14]: model.weight.tokenizer.decode(input_id) Out[14]: ' 这是pre prompt 你好' In [15]: model.weight.tokenizer.decode([2691]) Out[15]: '这是' In [16]: model.weight.tokenizer.decode([92311]) Out[16]: ' ' 问题：我在make_input后在prompt前加了个自定义pre_prompt("这是pre prompt")，然后用model.weight.tokenizer.encode编码，可以看到编码得到的token会多个92311，这个token就是空格（从decode的结果也可以看到在"这是"前多了个空格）

yiguanxian commented 7 months ago

另外，我为什么要在make_input产生的prompt前加pre_promt，是因为我发现如果把pre_prompt放到转模型中去会很不方便，因为一旦我修改pre_prompt又要去转一次模型，这样很不方便，因此我把它放到模型推理时来拼接（pyfastllm.create_model创建的model又无法访问pre_prompt属性，因此无法重置只能拼接了）。

Zhiwei35 commented 7 months ago

+1, 我在单独使用fastllm的tokenizer encode的时候，输入一个英文句子，也会产生多余的空格，不确定这会不会对推理结果造成影响

TylunasLi commented 7 months ago

根据sentencepiece_model.proto的定义：

  // Adds dummy whitespace at the beginning of text in order to
  // treat "world" in "world" and "hello world" in the same way.
  optional bool add_dummy_prefix = 3 [default = true];

值add_dummy_prefix用来控制是否在输入序列i前面加空格。这个值在不同模型中是不一样的。例如：

ChatGLM3-6B：

>>> import sentencepiece.sentencepiece_model_pb2 as model
>>> m = model.ModelProto()
>>> m.ParseFromString(open('tokenizer.model', 'rb').read())
1018370
>>> m.normalizer_spec.add_dummy_prefix
True

Baichuan2-7B-Chat：

>>> import sentencepiece.sentencepiece_model_pb2 as model
>>> m = model.ModelProto()
>>> m.ParseFromString(open('Baichuan2-7B-Chat/tokenizer.model', 'rb').read())
2001107
>>> m.normalizer_spec.add_dummy_prefix
False

目前，fastllm没有支持读取这个值。