Implement custom `tokenizers.decoders.Decoder` for output serialization

NB: This issue does not refer to the decoder part of the transformer model, but instead to the decoder belonging to the huggingface tokenizer. The tokenizer's decoder converts token ids back to text, joins subword-tokens and puts the textual tokens back together into a single string (sometimes called 'detokenization').

Using the build-in WordPiece-Decoder leads to some unwanted results, namely:

Spaces before forward slashes (which serve as comma-like punctuation symbols in older texts), e.g. der tötende Ausbruch / heißt nicht leben /
Spaces before colons, e.g. sondern auch sonsten :
Spaces around intra-word-hyphens, e.g. zieret seine Amts - Gaben

To solve these problems we would need a custom decoder for the tokenizer. There is an issue on GitHub with a simple example.

Further material:

Code base for the original decoder implementation (in Rust)

ybracke / transnormer

Implement custom `tokenizers.decoders.Decoder` for output serialization #53