In the DTA EvalCorpus old and modern versions of a publication are aligned on the token level.
If the modern version contains a single token where the old version contains 2 or more tokens, e.g. zurechtwies vs. zurecht wieß, this is represented in the data as follows:
The inner <w> elements are used for a token-aligned version where the old version is the base layer. Here, the new token is split as well, leading to a normalization that is intuitively not ideal (zurecht wieß --> zurecht wies). Currently, this is the gold normalization I use for training, because I use a plain text format that was created using the inner <w> elements in these cases. Thus, during training never sees the merge of two or more input tokens to a single output token and hence, does not learn to merge them.
I should change my training data loader to use the XML version of the corpus. Then, I can adjust how the loader deals with these "JOIN" cases, i.e. have it use the outer <w> instead of the inner.
The new XML loader iterates over <w> elements that are immediate childs of <s>. This excludes nested <w> (which are used in case of a tokenization mismatch between old and new, i.e. when class="JOIN").
In the DTA EvalCorpus old and modern versions of a publication are aligned on the token level.
If the modern version contains a single token where the old version contains 2 or more tokens, e.g. zurechtwies vs. zurecht wieß, this is represented in the data as follows:
The inner
<w>
elements are used for a token-aligned version where the old version is the base layer. Here, the new token is split as well, leading to a normalization that is intuitively not ideal (zurecht wieß --> zurecht wies). Currently, this is the gold normalization I use for training, because I use a plain text format that was created using the inner<w>
elements in these cases. Thus, during training never sees the merge of two or more input tokens to a single output token and hence, does not learn to merge them. I should change my training data loader to use the XML version of the corpus. Then, I can adjust how the loader deals with these "JOIN" cases, i.e. have it use the outer<w>
instead of the inner.