ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

mT5 Tokenizer #94

Open ybracke opened 3 months ago

ybracke commented 3 months ago

Running the subword tokenizer for mT5 (vs. byT5) works but gives the following warning:

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565

Updating transformers from 4.31 to 4.38 (current version) broke some dependencies, so I rolled back to 4.31 for now.

Previously I got another warning when calling from_pretrained, but this can be solved by passing use_fast=False (see branch dev-nofasttokenizer):

UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.