ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

Load training data from XML version DTA EvalCorpus and keep joined tokens #17

Closed ybracke closed 1 year ago

ybracke commented 1 year ago

In the DTA EvalCorpus old and modern versions of a publication are aligned on the token level.
If the modern version contains a single token where the old version contains 2 or more tokens, e.g. zurechtwies vs. zurecht wieß, this is represented in the data as follows:

<w class="JOIN" new="zurechtwies" old="zurecht wieß" pok="1" seen="1" wok="1">
  <w class="JOIN" new="zurecht" old="zurecht" pok="1" seen="1" wok="1"/>
  <w class="JOIN" new="wies" old="wieß" pok="1" seen="1" wok="1"/>
</w>

The inner <w> elements are used for a token-aligned version where the old version is the base layer. Here, the new token is split as well, leading to a normalization that is intuitively not ideal (zurecht wieß --> zurecht wies). Currently, this is the gold normalization I use for training, because I use a plain text format that was created using the inner <w> elements in these cases. Thus, during training never sees the merge of two or more input tokens to a single output token and hence, does not learn to merge them. I should change my training data loader to use the XML version of the corpus. Then, I can adjust how the loader deals with these "JOIN" cases, i.e. have it use the outer <w> instead of the inner.

ybracke commented 1 year ago

The new XML loader iterates over <w> elements that are immediate childs of <s>. This excludes nested <w> (which are used in case of a tokenization mismatch between old and new, i.e. when class="JOIN").