Improve training data: post-correct CAB-normalized DTA texts

ybracke commented 1 year ago

The CAB-normalized versions of the DTA can be used as training data for a ML-based normalizer. While the CAB-normalizations have proven well, they are not perfect. Here are a couple of issues with the data:

Normalizations do not follow modern German spelling rules as they are after the 1996 spelling reform. We find normalizations like "muß", "daß", "Schiffahrt", which should rather be "muss", "dass" and "Schifffahrt, respectively.
CAB-normalizations follow the space-based tokenization of the historical data. That means that a single historical token that would be two words in modern German is nonetheless spelled as a single token in the normalized form, albeit with an underscore (e.g. "gehts" -> "geht_es"). On the other hand multiple historical tokens that would be spelled as a single word today (e.g. "aller Hand" -> "allerhand") are never merged in CAB-normalizations.
In certain cases the serialized (de-tokenized) modern AND historical text contain unwanted spaces between capitalized, hyphenated word parts (Süd- Westen instead of Süd-Westen).
More issues can be found in this list

Workflow to tackle the issues

Some of these issues can be dealt with automatically. This is done with the following scripts/workflows:

serialize-dta-ddctabs
1. Replacement of unwanted underscores by spaces (will_es -> will es) on the normalized text layer.
2. Removal of (generally) unwanted spaces between two capitalized word parts connected by a hyphen (Süd- Westen -> Süd-Westen) on original and normalized text layer.
3. Replacement of tokens on the normalized text layer to follow modern German orthography
  - This requires the compilation of a mapping of old to new orthographic forms. I compiled such a mapping with the help of SMOR. The project can be found here: oldorth-list
langtool-correct
- Apply selected LanguageTool rules and categories to a collection of plain text files. In the context of this project, it mainly serves the purpose of fixing the spelling of words separately or together (e.g. aller Hand -> allerhand; categories COMPOUNDING, EMPFOHLENE_RECHTSCHREIBUNG) and to catch some further old spellings (rule: OLD_SPELLING).
- It might be sensible or necessary run multiple iterations and create several versions of the data that are gradually more changed than others.
- Includes a notebook to check out the differences between modified versions

[3. Optional manual post-correction]

Note: The additional processing of the data (with dta2jsonl) does not apply further text modifications. It only deals with adding metadata, moving into a new format (JSONL), and splitting the data sets based on metadata.

TODO: Create a few manually post-corrected normalizations for comparison and problem analysis (DONE for 17th cent., see c3po: /home/bracke/data/dta/dtak/dtak-1600-1699-train-head100-anno.jsonl). Compare the automatic and manual post-corrections (diff) to see how well LanguageTool does it

ybracke commented 1 year ago

~~It would make sense to apply some or~~ We have applied almost all of this workflow also to the DTA EvalCorpus, because (1) most of the texts are also in old spelling and (2) the texts are token-aligned based on historical tokenization, thus we also have the "aller Hand"/"allerhand" problem.

ybracke commented 9 months ago

Done for DTA EvalCorpus, thus closing

ybracke / transnormer

Improve training data: post-correct CAB-normalized DTA texts #62

Workflow to tackle the issues