welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Decide on an XML schema for motions #348

Closed MansMeg closed 10 months ago

MansMeg commented 12 months ago

We are now having two different solutions on how to store motions and it is not clear which approach is the best one. Below are the two potential solutions.

The TEI ParlaClarin format The TEI parlaclarin is a TEI XML format for structuring and annotating parliamentary protocols. https://github.com/clarin-eric/parla-clarin/tree/master

Pro:

Con:

The Riksdagen Motion format with additions The Swedish parliament already have a format to store motion that is used by the Riksdagens Öppna data. https://www.riksdagen.se/sv/dokument-och-lagar/riksdagens-oppna-data/dokument/ Download motions as XML in zip files.

Pro:

Con:

BobBorges commented 12 months ago

The Riksdagen Motion format with additions ... we would need to add additional tags to the current XML schema.

AFAIC this is a significant Con. I guess we'd have to spend significant amount of time to decide what gets included and exactly how.

The TEI ParlaClarin format

Can we use this for motions without adding or otherwise fiddling with the specification? It seems to me like it's ideally suited, so +1 for TEI ParlaClarin and -1 for the Riksdagen Motion format.

MansMeg commented 12 months ago

I'm not sure if the TEI has this in their specifications? @ninpnin might know. My guess is that both schemas require fiddling. Just in different ways. @ninpnin ?

BobBorges commented 12 months ago

TEI by itself is probably too generic, but ParlaClarin TEI has a lot of what we need -- example. Could you post the photo of the whiteboard after the project meeting where this was discussed? Maybe we could look item by item through the info that we want to encode and see how it would look in ParlaClarin.

fredrik1984 commented 12 months ago

This is kind of the prioritization of annotating content in motions:

  1. Year, chamber, motion number (these are found in the title of each motion), motion body text, MP signers of motion (last in motion, if possible in the right order)
  2. Motion intro, MP in motion intro (i.e. main responsible MP for the motion, either one or two MPs), decision points (att-satser), place and dates (e.g. Stockholm den 3 oktober 2007)

Here is a photo from the motion workshop in June:

Unknown

MansMeg commented 12 months ago

The example you showed was for protocols, not for motions?

BobBorges commented 12 months ago

Sure. The whole ParlaClarin is about protocols -- from their landing page:

a TEI customisation for annotating parliamentary debates

I have to compare point for point with the photo, but my point is that it seems most if not all what we want to annotate is already implemented in that schema.

ninpnin commented 11 months ago

Not only we need to add elements, we need to do everything ourselves. There are no schemas, there is no format to store the actual content, no tests that we get for free, no knowledge from working with TEI for years, no documentation, no community to rely on.

AFAIK, ParlaClarin doesn't add any elements of their own to TEI, they just define what each element type means in the context of parliamentary debates. They do only one part for parliamentary debates that we need to do for motions if we decide to extend the Riksdagen Motion format. The elements, the schemas, etc. come with the TEI package.

BobBorges commented 11 months ago

Are we bound to XML?

MansMeg commented 11 months ago

I would say yes. We should keep to a few formats as possible.

BobBorges commented 11 months ago

Another (last) point from me, which is a pro-parlaclarin / con-riksdagxml is about consistency in our corpus -- to me it would look and feel very strange to have such wildly different formats (shema/no-schema, tagsets, attrubutes, doc organization) within the same corpus.

I'm interested what the tech-advisory board says.

MansMeg commented 10 months ago

Decision made: We go with the TEI ParlaClarin format, but test that we can transform the TEI format to the Riksdagen open data XML format so we don't loose any data