Open xyzhang626 opened 10 months ago
Thanks for your kind.
The problem is that the all-MiniLM
model can not tokenize all Chinese, eg, '你好,世界'
100 <--> [UNK]
100 <--> [UNK]
1989 <--> ,
1745 <--> 世
100 <--> [UNK]
And the text segmentation of bge zh series or MiniLM models are is split by individual characters, not by words.
The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
is split by words, but it can not work with embeddings.cpp:
# the quantize is ok
pushd models
git clone https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
cd paraphrase-multilingual-MiniLM-L12-v2
wget -O vocab.txt https://huggingface.co/michaelfeil/ct2fast-paraphrase-multilingual-MiniLM-L12-v2/resolve/main/vocabulary.txt?download=true
./run_conversions.sh paraphrase-multilingual-MiniLM-L12-v2
popd
python examples/test_hf_tokenizer.py paraphrase-multilingual-MiniLM-L12-v2
build/bin/test_tokenizer -m models/paraphrase-multilingual-MiniLM-L12-v2/ggml-model-q4_0.bin
tokenizer test failed: '你好,世界!'
[101, 994, 1322, 100, 6717, 11888, 100, 102, ]
0 -> <s> : 101 -> ▁_
6 -> ▁ : 994 -> 你
124084 -> 你好 : 1322 -> 好
4 -> , : 100 -> ▁for
3221 -> 世界 : 6717 -> 世
38 -> ! : 11888 -> 界
2 -> </s> : 100 -> ▁for
2 -> </s> : 102 -> ta
Maybe the Pre-Tokenization is missing?
hey @snowyu sorry for the late reply and thanks for letting me know this.
Pre-tokenization is not missing in this repo but the strategy seems different from paraphrase-multilingual-MiniLM-L12-v2
. Actually I handle the Chinese char in the same way with huggingface rust version, where whitespace is inserted between Chinese char, see hf rust implementation and this repo implementation. It's the exact reason why Chinese words are split and tokenized.
I think the differences are caused by different tokenization algorithm. all-MiniLM
and bge-small-zh-v1.5
use the algorithm called WordPiece, where an important feature is to use subword when tokenizing words (Chinese char is treated as a single word), for example, the "tokenization" will be tokenized into "token" and "##ization", the special symbol "##" denotes the subword starts from the middle of the word.
However, in the tokenized results from paraphrase-multilingual-MiniLM-L12-v2, I did not find something similar. I suspect it uses different tokenization algorithm (not reported in their paper or model card). Since paraphrase-multilingual-MiniLM-L12-v2
is not at the leading position in MTEB benchmark, it might indicate that tokenizing Chinese words rather than single character might not be necessary, at least from the view of performance. Anyway this is an interesting point I will try to figure it out when free.
The Pre-tokenizers has many types, the WordPiece is just one of them. see the tokenizer.json
file in the paraphrase-multilingual-MiniLM
:
{
...,
"pre_tokenizer":{
"type":"Sequence",
"pretokenizers":[
{"type":"WhitespaceSplit"},
{"type":"Metaspace","replacement":"▁", ...
}
More details in HF Document: Pre-tokenizers.
the js code may be more clear, all in one file: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js
Summary
This repo's tokenizer works consistently with huggingface's tokenizer in the most cases and works inconsistently but possibly better bge zh series model.
Details
bge-small-zh-v1.5
tokenizer will be bad at 1) words with capital letter 2) accent letter. It can be caused by the normalization setting of it.For example, in the case
大家好我是GPT
, hf tokenizer (left column) can not recognize the upperGPT
but tokenizer in this repo (right column) can do it.It's similar for the accent case.
If you find any more differences between tokenizer in this repo with the huggingface one, please let me know I will try to fix it.