Tokenizer consistency - Githubissues

ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.

GNU General Public License v3.0

6 stars 1 forks source link

Tokenizer consistency #15

Closed ybracke closed 4 months ago

ybracke commented 1 year ago

We currently use two different tokenizers: one for historical language and one for modern language. It seems to be common (or the only possible way?) to use only a single tokenizer if you want to create a huggingface model (transformers.EncoderDecoderModel) including a tokenizer.