Synthetic training data

ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.

GNU General Public License v3.0

6 stars 1 forks source link

Norma might be the easiest approach, see ~/code/norma/HOWTO.md

potential drawback: translation is deterministic + can't do k-best (unlike a transformer)
it might be reasonable to not use the lexicon option (but I'm not sure - just check both)

What we would need:

Data to train Norma
- format: token-aligned tsv, reversed (norm\torig)
- available data: dta eval, ridges bollmann, perhaps GerMan-C
historic lexicon as "target" lexicon (create from dta, does something exist already?)
modern text to convert into historic variant
- available data: leipzig zeitung corpora; books, newspapers, etc from the DWDS
- convert to correct format, e.g. by using the WASTE tokenizer or any other tokenizer that gives you sentence boundaries and character offsets
serialization script to convert single token text plus byte offsets back to sentences

ybracke / transnormer