Additional gold training data

ybracke commented 1 year ago

Siehe ~/data/Overview.ods or here

ybracke commented 1 year ago

Kloster Chronik Münster

from ~1730
"normalized" with chatGPT
See e-mail conversation from March 2023 with D. Voltz and here
Ask again

ybracke commented 1 year ago

Simplicissimus

Original is part of DTA
There is a modern German version from 2009, see here
The modern version seems to contain word-by-word translations, but also parts with heavier changes, see this article

ybracke commented 1 year ago

DiBiLit/zeno

https://deutschestextarchiv.de/dibilit/
Susanne and Christian remember historical and modern versions, but they cannot be found (see here)

ybracke commented 1 year ago

Abitur-Klausuren (from 1917 onward)

Project website
ANNIS (Username and PW: see e-mail 2023-04-17)
1917-1944 will be added in May 2023
- Update: Still not added on 2023-05-23
TODO: Check again in early June 2023 and ask again

ybracke commented 1 year ago

DTA Erweiterungskorpus (not really "gold")

Use dta2jsonl. Needs to be adopted because there isn't metadata for all DTAE files.

ybracke commented 1 year ago

Wikisource

"Sammlung von Texten und Quellen, die entweder urheberrechtsfrei sind oder unter einer freien Lizenz stehen. Wikisource ist ein Qualitätsprojekt, das seine Texte mit den Scans der Quelle vergleichbar macht."
No normalizations available -> Could still be used for fine-tuning the historic encoder and/or during active learning
Some transliteration is applied, e.g. ē => en/em ; see here
Publication year is available as metadata
Data can be downloaded as plaintext, HTML, epub
- HTML contains typographic markup (<tt>), which can be helpful to identify Latin sequences
Would need a crawler script
Example: Collection "30-jähriger Krieg" (mentioned by M. Boenig)

Selbstbeschreibung: "Typischerweise wird als Textgrundlage eine gedruckte Vorlage oder ein zuverlässiger E-Text gewählt, wobei die Auswahl der Textgrundlage besonders sorgfältig erfolgen sollte. Ziel der möglichst originalgetreuen, nicht durch fragwürdige Normalisierungen beeinträchtigten Textwiedergabe ist es, dass nicht nur Laien, sondern auch Wissenschaftler den Text verwenden können."

ybracke commented 1 year ago

Anselm

https://ep.liu.se/ecp/087/003/ecp1387003.pdf
14th to 16th century
Use the original data or the NoSta-D data, not the Bollmann split because in the Bollmann split individual documents are not separated - everything is in one big file
Source: https://github.com/coastalcph/histnorm/tree/master/datasets/historical/german
Normalization guidelines
The orig and normalized layer are all-lowercased
The normalized layer contains some errors; is it really manually annotated?
Not split into sentences

ybracke commented 1 year ago

Historische Korpora IDS

Link

ybracke commented 1 year ago

Referenzkorpus Frühneuhochdeutsch

Link, ANNIS
1300-1650
Annotations: diplomatic transcription, lemma, morphological info (no normalization of overt form, but perhaps this can be generated from lemma+morph info?)
Documentation

ybracke commented 1 year ago

Historisches Vorlesungskorpus

Link
Zürich, started 2018, cannot find more info
Contact: Michael Prinz

ybracke commented 1 year ago

Klosterfrauenkorpus

normalized with CAB
17th century, 8 texts
weird XML or ANNIS, perhaps use Pepper for convertin ANNIS to TEI or CONLL

ybracke commented 1 year ago

"Entwicklung der satzinternen Großschreibung im Deutschen"

Link Münster, Link Hamburg, Link Bamberg
"aus 56 handschriftlichen Hexenverhörprotokollen, die in der Zeit zwischen 1570 und 1665 entstanden sind (ca. 62.000 Wortformen)"
Projektleitung: Renata Szczepaniak (now Leipzig)
R. Szczepaniak on 2023-05-23: Will be posted this week on Laudatio

ybracke commented 1 year ago

Gesellschaftliche Wissensproduktion in der Aufklärung

Text- und netzwerkanalytische Diskursrekonstruktion. Die Halleschen Zeitungen und Zeitschriften 1688–1815; Link
"Die Publikation der vollständigen Bibliographie aller 356 halleschen Zeitungen und Zeitschriften befindet sich in Vorbereitung." -> Unclear whether texts themselves will be published. But CAB is mentioned in this paper.
Concat: Anne Purschwitz

ybracke commented 1 year ago

Further hints

Ask if anyone is aware of projects that used CAB for normalizations:
- in IT+CL channel (when Frank is back)
- via a mailing list?
- The existing workflow to improve the CAB-normalizations could be applied.
Search in: Historisches Datenzentrum Sachsen-Anhalt and similar projects

ybracke commented 1 year ago

GerManC-GS

Prerequisite: tool to split the tokenized text into sentences - or could I use pos annotations ($.) for that?
Check whether to use GerManC_GS_XML and exclude headings, stage directions, etc. See:
https://git.zdl.org/ybracke/GerManC-GS
https://github.com/zentrum-lexikographie/eval-de-lemma/blob/b893edcb744adee2ce661d71d915208465950939/src/reader.py#L91

ybracke / transnormer-data