Failed to encode text with T5's tokenizer

zcbenz commented 1 week ago

System Info

Node.js v22.9.0. "@xenova/transformers": "2.17.2"

Environment/Platform

[ ] Website/web-app
[ ] Browser extension
[X] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

Encoding text with tokenizer of google-t5/t5-small returns undefined.

Reproduction

Run code with Node.js:

import {AutoTokenizer} from '@xenova/transformers'

const tokenizer = await AutoTokenizer.from_pretrained('google-t5/t5-small')
console.log(tokenizer.encode('test'));

Outputs:

[ undefined, 1 ]

xenova commented 1 week ago

Please use Xenova/t5-small instead. The original t5 tokenizer config is missing a few values. :)

zcbenz commented 1 week ago

Can you share what is missing? I want to make my code compatible with the official model if it is not too hard.

zcbenz commented 1 week ago

I did an experiment try to fix the tokenizer in original T5 repo:

Download the files of google-t5/t5-small.
Run this script to export new json files for the malformed tokenizer.

And this time the tokenizer in transformers.js can actually load the exported tokenizer, however it generates wrong results.

For the prompt translate English to German: As much as six inches of rain could fall in the New York City region through Monday morning, and officials warned of flooding along the coast., the tokenizer in python gives result:

[
  13959, 1566,    12, 2968,    10,  282,
    231,   38,  1296, 5075,    13, 3412,
    228, 1590,    16,    8,   368, 1060,
    896, 1719,   190, 2089,  1379,    6,
     11, 4298, 15240,   13, 18368,  590,
      8, 4939,     5,    1
]

Which is the same with the result of transformers.js with tokenizers fromXenova/t5-small.

But when using transformers.js with the exported tokenizers from google-t5/t5-small, the result became:

[
  7031, 5867, 26749, 235, 24518,   10,   188,     7,
    51, 2295,     9,   7,     7, 2407,    77,  2951,
   858, 6559,   509,  83,    26, 2857,    77,   532,
  6861,  476,   127, 157,   254,  485, 18145, 11258,
  9168, 1135,  2528,  29,    53,    6,   232, 20884,
     7, 2910,    29,  15,    26,  858,    89,    40,
    32,   32,    26,  53,     9, 2961,   532, 25500,
     5,    1
]

The troubling tokenizer files are: tokenizer.json tokenizer_config.json

The diff between tokenizer.json and the one in Xenova/t5-small is:

947,948c947
<         "prepend_scheme": "always",
<         "split": true
---
>         "add_prefix_space": true
1009,1010c1008
<     "prepend_scheme": "always",
<     "split": true
---
>     "add_prefix_space": true
129416,129417c129414
<     ],
<     "byte_fallback": false
---
>     ]

xenova commented 1 week ago

The tokenizer files present in google-t5/t5-small were exported ~4 years ago, and are outdated with the current version of transformers.js (missing some values in tokenizer_config.json, I believe). I suppose we could add the default values which are probably missing in the tokenizer config, but this might not be scalable to do for all tokenizers.

As you mentioned above, when using Xenova/t5-small, I get the same results as the python library (demo link). These are obtained by simply doing AutoTokenizer.from_pretrained('google-t5/t5-small').save_pretrained('output')

zcbenz commented 1 week ago

I have used AutoTokenizer.from_pretrained('google-t5/t5-small').save_pretrained('output') to convert the tokenizers from the google-t5/t5-small repo, and the saved tokenizer.json file is slightly difference from the one in Xenova/t5-small, which I believe is caused by changes in python transformers, and it is the newly saved tokenizer that transformers.js produces wrong results with.

The tokenizer.json file I got from google-t5/t5-small by using save_pretrained: tokenizer.json

If you replace the tokenizer.json file in Xenova/t5-small with it, you can get the wrong results in transformers.js. (The tokenizer_config.json did not change the behavior as I tested.)

It seems that transformers.js can correctly handle this decoder:

  "decoder": {
    "type": "Metaspace",
    "replacement": "▁",
    "add_prefix_space": true
  },

but fails with this one:

  "decoder": {
    "type": "Metaspace",
    "replacement": "▁",
    "prepend_scheme": "always",
    "split": true
  },

xenova commented 1 week ago

I believe I have figured it out: the tokenizer.json in google-t5/t5-small is missing the type field for the tokenizer mode (so it doesn't use the Unigram tokenizer model. I have added 32d8df40c184853c0697747e75bc12624530252c to support this legacy tokenizer (since I imagine it's a popular tokenizer to use).

Note that you will need to install @huggingface/transformers (Transformers.js v3) when the next release comes out, and I will close this issue when it is out 👍

zcbenz commented 1 week ago

Thanks a lot!

xenova / transformers.js