Tokenizer works inconsistently but better for bge zh series model

xyzhang626 commented 10 months ago

Summary

This repo's tokenizer works consistently with huggingface's tokenizer in the most cases and works inconsistently but possibly better bge zh series model.

Details

bge-small-zh-v1.5 tokenizer will be bad at 1) words with capital letter 2) accent letter. It can be caused by the normalization setting of it.

For example, in the case 大家好我是GPT, hf tokenizer (left column) can not recognize the upper GPT but tokenizer in this repo (right column) can do it.

It's similar for the accent case.

If you find any more differences between tokenizer in this repo with the huggingface one, please let me know I will try to fix it.

snowyu commented 10 months ago

Thanks for your kind.

The problem is that the all-MiniLM model can not tokenize all Chinese, eg, '你好，世界'

100 <--> [UNK]
100 <--> [UNK]
1989 <--> ，
1745 <--> 世
100 <--> [UNK]

And the text segmentation of bge zh series or MiniLM models are is split by individual characters, not by words.

The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 is split by words, but it can not work with embeddings.cpp:

# the quantize is ok
pushd models
git clone https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
cd paraphrase-multilingual-MiniLM-L12-v2
wget -O vocab.txt https://huggingface.co/michaelfeil/ct2fast-paraphrase-multilingual-MiniLM-L12-v2/resolve/main/vocabulary.txt?download=true
./run_conversions.sh paraphrase-multilingual-MiniLM-L12-v2
popd

python examples/test_hf_tokenizer.py paraphrase-multilingual-MiniLM-L12-v2
build/bin/test_tokenizer -m models/paraphrase-multilingual-MiniLM-L12-v2/ggml-model-q4_0.bin 
tokenizer test failed: '你好，世界！'
[101, 994, 1322, 100, 6717, 11888, 100, 102, ]
0 -> <s> : 101 -> ▁_
6 -> ▁ : 994 -> 你
124084 -> 你好 : 1322 -> 好
4 -> , : 100 -> ▁for
3221 -> 世界 : 6717 -> 世
38 -> ! : 11888 -> 界
2 -> </s> : 100 -> ▁for
2 -> </s> : 102 -> ta

Maybe the Pre-Tokenization is missing?

xyzhang626 commented 10 months ago

hey @snowyu sorry for the late reply and thanks for letting me know this.

Pre-tokenization is not missing in this repo but the strategy seems different from paraphrase-multilingual-MiniLM-L12-v2. Actually I handle the Chinese char in the same way with huggingface rust version, where whitespace is inserted between Chinese char, see hf rust implementation and this repo implementation. It's the exact reason why Chinese words are split and tokenized.

I think the differences are caused by different tokenization algorithm. all-MiniLM and bge-small-zh-v1.5 use the algorithm called WordPiece, where an important feature is to use subword when tokenizing words (Chinese char is treated as a single word), for example, the "tokenization" will be tokenized into "token" and "##ization", the special symbol "##" denotes the subword starts from the middle of the word.

However, in the tokenized results from paraphrase-multilingual-MiniLM-L12-v2, I did not find something similar. I suspect it uses different tokenization algorithm (not reported in their paper or model card). Since paraphrase-multilingual-MiniLM-L12-v2 is not at the leading position in MTEB benchmark, it might indicate that tokenizing Chinese words rather than single character might not be necessary, at least from the view of performance. Anyway this is an interesting point I will try to figure it out when free.

snowyu commented 10 months ago

https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/pre_tokenizers/metaspace.rs

snowyu commented 10 months ago

The Pre-tokenizers has many types, the WordPiece is just one of them. see the tokenizer.json file in the paraphrase-multilingual-MiniLM:

{
  ...,
  "pre_tokenizer":{
      "type":"Sequence",
       "pretokenizers":[
          {"type":"WhitespaceSplit"},
          {"type":"Metaspace","replacement":"▁", ...
}

More details in HF Document: Pre-tokenizers.

the js code may be more clear, all in one file: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js

xyzhang626 / embeddings.cpp

Tokenizer works inconsistently but better for bge zh series model #1

Summary

Details