Open zcbenz opened 1 week ago
Please use Xenova/t5-small
instead. The original t5 tokenizer config is missing a few values. :)
Can you share what is missing? I want to make my code compatible with the official model if it is not too hard.
I did an experiment try to fix the tokenizer in original T5 repo:
google-t5/t5-small
.And this time the tokenizer in transformers.js can actually load the exported tokenizer, however it generates wrong results.
For the prompt translate English to German: As much as six inches of rain could fall in the New York City region through Monday morning, and officials warned of flooding along the coast.
, the tokenizer in python gives result:
[
13959, 1566, 12, 2968, 10, 282,
231, 38, 1296, 5075, 13, 3412,
228, 1590, 16, 8, 368, 1060,
896, 1719, 190, 2089, 1379, 6,
11, 4298, 15240, 13, 18368, 590,
8, 4939, 5, 1
]
Which is the same with the result of transformers.js with tokenizers fromXenova/t5-small
.
But when using transformers.js with the exported tokenizers from google-t5/t5-small
, the result became:
[
7031, 5867, 26749, 235, 24518, 10, 188, 7,
51, 2295, 9, 7, 7, 2407, 77, 2951,
858, 6559, 509, 83, 26, 2857, 77, 532,
6861, 476, 127, 157, 254, 485, 18145, 11258,
9168, 1135, 2528, 29, 53, 6, 232, 20884,
7, 2910, 29, 15, 26, 858, 89, 40,
32, 32, 26, 53, 9, 2961, 532, 25500,
5, 1
]
The troubling tokenizer files are: tokenizer.json tokenizer_config.json
The diff between tokenizer.json
and the one in Xenova/t5-small
is:
947,948c947
< "prepend_scheme": "always",
< "split": true
---
> "add_prefix_space": true
1009,1010c1008
< "prepend_scheme": "always",
< "split": true
---
> "add_prefix_space": true
129416,129417c129414
< ],
< "byte_fallback": false
---
> ]
The tokenizer files present in google-t5/t5-small
were exported ~4 years ago, and are outdated with the current version of transformers.js (missing some values in tokenizer_config.json
, I believe). I suppose we could add the default values which are probably missing in the tokenizer config, but this might not be scalable to do for all tokenizers.
As you mentioned above, when using Xenova/t5-small
, I get the same results as the python library (demo link). These are obtained by simply doing AutoTokenizer.from_pretrained('google-t5/t5-small').save_pretrained('output')
I have used AutoTokenizer.from_pretrained('google-t5/t5-small').save_pretrained('output')
to convert the tokenizers from the google-t5/t5-small
repo, and the saved tokenizer.json
file is slightly difference from the one in Xenova/t5-small
, which I believe is caused by changes in python transformers
, and it is the newly saved tokenizer that transformers.js produces wrong results with.
The tokenizer.json
file I got from google-t5/t5-small
by using save_pretrained
:
tokenizer.json
If you replace the tokenizer.json
file in Xenova/t5-small
with it, you can get the wrong results in transformers.js. (The tokenizer_config.json
did not change the behavior as I tested.)
It seems that transformers.js can correctly handle this decoder:
"decoder": {
"type": "Metaspace",
"replacement": "▁",
"add_prefix_space": true
},
but fails with this one:
"decoder": {
"type": "Metaspace",
"replacement": "▁",
"prepend_scheme": "always",
"split": true
},
I believe I have figured it out: the tokenizer.json in google-t5/t5-small is missing the type
field for the tokenizer mode (so it doesn't use the Unigram
tokenizer model. I have added 32d8df40c184853c0697747e75bc12624530252c to support this legacy tokenizer (since I imagine it's a popular tokenizer to use).
Note that you will need to install @huggingface/transformers
(Transformers.js v3) when the next release comes out, and I will close this issue when it is out 👍
Thanks a lot!
System Info
Node.js v22.9.0.
"@xenova/transformers": "2.17.2"
Environment/Platform
Description
Encoding text with tokenizer of google-t5/t5-small returns undefined.
Reproduction
Run code with Node.js:
Outputs: