Closed ybracke closed 9 months ago
It would make sense to apply some or We have applied almost all of this workflow also to the DTA EvalCorpus, because (1) most of the texts are also in old spelling and (2) the texts are token-aligned based on historical tokenization, thus we also have the "aller Hand"/"allerhand" problem.
Done for DTA EvalCorpus, thus closing
The CAB-normalized versions of the DTA can be used as training data for a ML-based normalizer. While the CAB-normalizations have proven well, they are not perfect. Here are a couple of issues with the data:
Süd- Westen
instead ofSüd-Westen
).Workflow to tackle the issues
Some of these issues can be dealt with automatically. This is done with the following scripts/workflows:
serialize-dta-ddctabs
will_es
->will es
) on the normalized text layer.Süd- Westen
->Süd-Westen
) on original and normalized text layer.oldorth-list
langtool-correct
COMPOUNDING
,EMPFOHLENE_RECHTSCHREIBUNG
) and to catch some further old spellings (rule:OLD_SPELLING
).[3. Optional manual post-correction]
Note: The additional processing of the data (with
dta2jsonl
) does not apply further text modifications. It only deals with adding metadata, moving into a new format (JSONL), and splitting the data sets based on metadata.TODO: Create a few manually post-corrected normalizations for comparison and problem analysis (DONE for 17th cent., see c3po: /home/bracke/data/dta/dtak/dtak-1600-1699-train-head100-anno.jsonl). Compare the automatic and manual post-corrections (diff) to see how well LanguageTool does it