ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

Implement custom `tokenizers.decoders.Decoder` for output serialization #53

Open ybracke opened 1 year ago

ybracke commented 1 year ago

NB: This issue does not refer to the decoder part of the transformer model, but instead to the decoder belonging to the huggingface tokenizer. The tokenizer's decoder converts token ids back to text, joins subword-tokens and puts the textual tokens back together into a single string (sometimes called 'detokenization').

Using the build-in WordPiece-Decoder leads to some unwanted results, namely:

To solve these problems we would need a custom decoder for the tokenizer. There is an issue on GitHub with a simple example.

Further material: