flm的tokenizer和原始tokenizer分词结果不一致

chatglm2和baichuan2都有这个问题。

模型编译方式 from fastllm_pytools import llm from transformers import AutoTokenizer, AutoModel

hf_model = "/workspace/chatglm2-6B"

flm_dtype = "int8" model_name = hf_model.split("/")[-1] flm_model = f"/workspace/models/{model_name}-fastllm-{flm_dtype}.flm"

tokenizer = AutoTokenizer.from_pretrained(hf_model, trust_remote_code=True) model = AutoModel.from_pretrained(hf_model, trust_remote_code=True).half().cuda() model = llm.from_hf(model, tokenizer, dtype=flm_dtype) model.save(flm_model)

测试代码 prompt_input = "[Round 1]"

from transformers import AutoTokenizer model_path = "/workspace/chatglm2-6B" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) print(f"src prompt: {prompt_input}, token id: {tokenizer.encode(prompt_input)}") # R ound

import fastllm model_path = "/workspace/models/chatglm2-6B-fastllm-int8.flm" model = fastllm.create_llm(model_path) input_ids = model.weight.tokenizer.encode(prompt_input) input_ids = input_ids.to_list() input_ids = [int(v) for v in input_ids] print(f"fastllm prompt: {prompt_input}, token id: {input_ids}") # Ro und 3.测试结果原始的会将Round这个单词分成"R"和"ound"，而flm会将它分成 "Ro"和"und"。另外在百川2上输入"你是可爱"，原始的会将其分成"你是"和"可爱"，而flm转出来的baichuan2会将其分成" 你"，"是可"，"爱”

ztxz16 / fastllm

flm的tokenizer和原始tokenizer分词结果不一致 #397