Add intermediate files for curated and segmented data

ninpnin commented 3 years ago

A corpus is generated by taking the original data, curating it, segmenting it and then outputting it in some format. There are thus three intermediate versions of the data. I suggest these be saved per protocol in original.xml, curated.xml, segmented.xml, in the following schema:

<protocol id="prot--1975-1">
    <contentBlock page="4" index="3">
        <paragraph who="Helloworld Person" segment="speech">Hello world!</paragraph>
        <paragraph segment="notes">Some more text goes here.</paragraph>
    </contentBlock>
</protocol>

Curation and segmentation instances would still be logged to some file, so that the logic behind them can be traced to a pattern.

Currently, an equivalent of original.xml is stored in data/raw, but not uploaded to git. Moreover, the curated version of the data is obtained by reading original.xml and then applying the instances, whose result would be essentially the same as curated.xml.

A consistent format through different phases of the pipeline would harmonize and simplify the code.

ninpnin commented 3 years ago

The protocols are now processed in this schema in the data/protocols folder. Here's an example: https://github.com/welfare-state-analytics/riksdagen-corpus/blob/dev/data/protocols/prot-1921--ak--39/original.xml

The intermediate steps are generated from these original files, not saved on disk, and the process still relies on the curation and segmentation databases. Now that the format is harmonized, though, this decision can be changed with relatively little effort if we want to do that at some point.

ninpnin commented 3 years ago

Closing for now, I don't think saving intermediate on disk is necessary at this point.

welfare-state-analytics / riksdagen-corpus-old

Add intermediate files for curated and segmented data #28