NB: This issue does not refer to the decoder part of the transformer model, but instead to the decoder belonging to the huggingface tokenizer. The tokenizer's decoder converts token ids back to text, joins subword-tokens and puts the textual tokens back together into a single string (sometimes called 'detokenization').
Using the build-in WordPiece-Decoder leads to some unwanted results, namely:
Spaces before forward slashes (which serve as comma-like punctuation symbols in older texts), e.g. der tötende Ausbruch / heißt nicht leben /
Spaces before colons, e.g. sondern auch sonsten :
Spaces around intra-word-hyphens, e.g. zieret seine Amts - Gaben
To solve these problems we would need a custom decoder for the tokenizer. There is an issue on GitHub with a simple example.
Further material:
Code base for the original decoder implementation (in Rust)
NB: This issue does not refer to the decoder part of the transformer model, but instead to the decoder belonging to the huggingface tokenizer. The tokenizer's decoder converts token ids back to text, joins subword-tokens and puts the textual tokens back together into a single string (sometimes called 'detokenization').
Using the build-in WordPiece-Decoder leads to some unwanted results, namely:
der tötende Ausbruch / heißt nicht leben /
sondern auch sonsten :
zieret seine Amts - Gaben
To solve these problems we would need a custom decoder for the tokenizer. There is an issue on GitHub with a simple example.
Further material: