welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Discussion: Annotate layout and ordering information in document #101

Closed MansMeg closed 1 year ago

MansMeg commented 2 years ago

The protocols right now are digital representations of the physical documents. This means that we need to have information on how to handle "avstavning" and that different textblocks now are separated into different blocks. As an example, see: protocol 14 year 1921 (ak) at line 149.

This actually boils down to what we want to keep in the corpus. Is it a representation of actual physical documents/protocols or the text contents that represent the protocols. I previously thought that the main thing we would be interested in would be the connections to the physical documents, but after reading Sinikallio et al (2021) I'm not that sure. They focus on the actual text instead of the physical copy and combine the text from different textblocks. In a way, this makes sense for a corpus such as ours. Especially when more and more information on the protocols will become available in digital, rather than analogue format. Still, historians want to know exactly what page the textual information comes from.

I think to solve this we need to take three things into account:

Suggestion I think we should go in the direction Sinikallio et al (2021) and treat the content of the protocols as the relevant part. Then we treat the connections to a physical copy as additional metadata/annotations that can be used for these direct connections. We also remove the page data that is not part of the body text of the protocols (i.e. the marginal notes such as the page number and date of each page). An example of this is "(Forts.)" below. That is simply an artifact for the reader to know how the bodytext is structured in the actual document.

Example Note! This is a made-up example.

<seg n="bd253ed3">
interpellation. för frågans behandling av vederbörande stadsmyndigheter i Göte-
</seg>
<note n="52dd7536">
(Forts.)
</note>
<pb n="1" facs="https://betalab.kb.se/prot-1921--ak--14/prot_1921__ak__14-001.jp2/_view"/>
<seg n="bd253ed3">
borg är av vikt, att besked snarast lämnas uti av interpellanten
berörda avseenden, har jag ansett mig böra redan nu lämna ett
svar.
</seg>
<seg n="bd253ed3">
interpellation. för frågans behandling av vederbörande stadsmyndigheter i Göte<pb n="1" facs="https://betalab.kb.se/prot-1921--ak--14/prot_1921__ak__14-001.jp2/_view"/>borg är av vikt, att besked snarast lämnas uti av interpellanten berörda avseenden, har jag ansett mig böra redan nu lämna ett svar.
</seg>
MansMeg commented 2 years ago

It would be really interesting to hear the opinions of you all on this: @ljo , @Stubbendorff , @ninpnin , @rbbby , @fredrik1984

Fredrik, I actually think I changed my mind since the last time we spoke. =)

Stubbendorff commented 2 years ago

sounds good, as long as page numbers are kept on some level. those switching between close and distant reading need page numbers to cite examples.

fredrik1984 commented 2 years ago

I agree, sounds like a good idea. And I actually thought that this was our idea from the beginning!

Fredrik Norén PhD, Senior research assistant Humlab Umeå University SE-901 87 Umeå, Sweden +46 (0)73 995 10 15

umu.se/personal/fredrik-noren/http://umu.se/personal/fredrik-noren/ westac.se/en inidun.github.io

[cid:B2E5D2EF-2346-49EB-B8BE-F2B23E18D6EF]

18 dec. 2021 kl. 11:03 skrev Stubbendorff @.**@.>>:

sounds good, as long as page numbers are kept on some level. those switching between close and distant reading need page numbers to cite examples.

— Reply to this email directly, view it on GitHubhttps://github.com/welfare-state-analytics/riksdagen-corpus/issues/101#issuecomment-997179684, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADUCDPV4Z5BC7I4YSICQKCDURRMA3ANCNFSM5KKPCNOQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

ninpnin commented 2 years ago

I am still hesitant about deleting large parts of data. You can go wrong easily, and not even notice.

Additionally, I actually thought of going to the opposite direction. We could test the data to have the exact same words per page as in the OCR'd alto files.

MansMeg commented 2 years ago

That's a good point @ninpnin . I interpret this as we still "care" about the general body text in the documents, but we keep layout information as meta-data. The problem is just where to put this layout meta-data, since it is actually not a part of the text, rather part of the page? The pagebreak make sense to put in a segment, but how would we do with this additional metadata? Now it looks like it is part/note in the body text?

<seg n="bd253ed3">
interpellation. för frågans behandling av vederbörande stadsmyndigheter i Göte-
</seg>
<note n="52dd7536">
(Forts.)
</note>
<pb n="1" facs="https://betalab.kb.se/prot-1921--ak--14/prot_1921__ak__14-001.jp2/_view"/>
<seg n="bd253ed3">
borg är av vikt, att besked snarast lämnas uti av interpellanten
berörda avseenden, har jag ansett mig böra redan nu lämna ett
svar.
</seg>
ninpnin commented 1 year ago

Resolved, we keep everything