welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Missing record id #485

Closed MansMeg closed 8 months ago

MansMeg commented 8 months ago

I need to extract the record ids from the files. In some files this exist, as it should, as an XML id object in the TEI node: <TEI xml:id="prot-1896--ak--42">

But this is not the case in all records. We should add this to all records and also add a unit test to ensure that it exists throughout.

BobBorges commented 8 months ago

It seems like most of the protocols don't have this ID image

MansMeg commented 8 months ago

No. Can you add them? I think it is quite easy. Now when you have a test.

BobBorges commented 8 months ago

Can you use the head element instead? image

This is in every protocol already.

MansMeg commented 8 months ago

Hmm. Thats good. I could use that, but then we should rename head to id. The point is that we need a record id and that it is crystal clear that this is the id for the record.

Maybe should we move head to id, or rename head to id?

BobBorges commented 8 months ago

head is part of the tei namespace. could be in both places, I guess.

MansMeg commented 8 months ago

Yes. I think head is not crystal clear for a record id.

MansMeg commented 8 months ago

Or maybe make it even more clear. Can we ad a record_id under head in the preface?

ninpnin commented 8 months ago

I think we should use the XML ID attribute of the TEI element for this. See official ParlaClarin example

<TEI xml:id="document.id" xml:lang="en">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <!-- There are no rules on how these titles should be written -->
            <title>The parliament of the Republic of Slovenia</title>
            <title>Continuation of the second session</title>
            <title>30th January 2011</title>

The head element should be a header for the protocol anyway, but I don't think we need to change it now

<text>
   <front>
      <div type="preface">
         <!-- text before speeches started -->
         <head>THE PARLIAMENT OF THE REPUBLIC OF SLOVENIA</head>
         <head>Continuation of the second session</head>
         <docDate when="2011-01-30">30th January 2011</docDate>
      </div>
   </front>
   <body>
MansMeg commented 8 months ago

I think this make sense. I think it is clear that we should XML ID attribute of the TEI element. Good catch there @ninpnin .

ninpnin commented 8 months ago

https://github.com/swerik-project/riksdagen-records/pull/4

MansMeg commented 8 months ago

Excellent!