ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

Improve training data: post-correct CAB-normalized DTA texts #62

Closed ybracke closed 9 months ago

ybracke commented 1 year ago

The CAB-normalized versions of the DTA can be used as training data for a ML-based normalizer. While the CAB-normalizations have proven well, they are not perfect. Here are a couple of issues with the data:

Workflow to tackle the issues

Some of these issues can be dealt with automatically. This is done with the following scripts/workflows:

  1. serialize-dta-ddctabs

    1. Replacement of unwanted underscores by spaces (will_es -> will es) on the normalized text layer.
    2. Removal of (generally) unwanted spaces between two capitalized word parts connected by a hyphen (Süd- Westen -> Süd-Westen) on original and normalized text layer.
    3. Replacement of tokens on the normalized text layer to follow modern German orthography
      • This requires the compilation of a mapping of old to new orthographic forms. I compiled such a mapping with the help of SMOR. The project can be found here: oldorth-list
  2. langtool-correct

    • Apply selected LanguageTool rules and categories to a collection of plain text files. In the context of this project, it mainly serves the purpose of fixing the spelling of words separately or together (e.g. aller Hand -> allerhand; categories COMPOUNDING, EMPFOHLENE_RECHTSCHREIBUNG) and to catch some further old spellings (rule: OLD_SPELLING).
    • It might be sensible or necessary run multiple iterations and create several versions of the data that are gradually more changed than others.
    • Includes a notebook to check out the differences between modified versions

[3. Optional manual post-correction]

Note: The additional processing of the data (with dta2jsonl) does not apply further text modifications. It only deals with adding metadata, moving into a new format (JSONL), and splitting the data sets based on metadata.

TODO: Create a few manually post-corrected normalizations for comparison and problem analysis (DONE for 17th cent., see c3po: /home/bracke/data/dta/dtak/dtak-1600-1699-train-head100-anno.jsonl). Compare the automatic and manual post-corrections (diff) to see how well LanguageTool does it

ybracke commented 1 year ago

It would make sense to apply some or We have applied almost all of this workflow also to the DTA EvalCorpus, because (1) most of the texts are also in old spelling and (2) the texts are token-aligned based on historical tokenization, thus we also have the "aller Hand"/"allerhand" problem.

ybracke commented 9 months ago

Done for DTA EvalCorpus, thus closing