Here I use existing code (scripts/split_into_sections.py) to divide up the unicameral period protocols into sections (based on the § character), delimited by \
elements.
I also sneak in a script (scripts/git-add_diff-sample.py) which should work in tandem with @ninpnin 's sample-git-diffs, in order to quickly git add the files that were sampled from the diff.
@@ -8459,6 +8485,8 @@
<note xml:id="i-TXZbZyBcyHLKuS4xePmiYu">
Punkterna C-E Kammaren biföll vad utskottet i dessa punkter hemställt.
</note>
+ </div>
+ <div type="debateSection">
<note xml:id="i-TrSoYcWDyDqUdbqbHzSBf1">
§& 14 En analys av alkoholoch narkotikamissbrukets utveckling
på längre sikt, m. m.
@@ -11359,12 +11389,16 @@
<note xml:id="i-KshmEUYZVqJvnqouza5jW7">
Överläggningen var härmed slutad.
</note>
+ </div>
+ <div type="commentSection">
<note xml:id="i-A7baHPogjv4bSra7v7PAuj">
§ 16 Fru tredje vice talmannen meddelade att på föredragningslistan
för kammarens nästkommande sammanträde skulle näringsutskottets
betänkanden nr 73, 53 och 54 i nu angiven ordning uppföras främst
bland två gånger bordlagda ärenden.
</note>
+ </div>
+ <div type="commentSection">
<note xml:id="i-F9zYVTGxwmMetRGdf8CaXf">
§ 17 Anmäldes och bordlades Utbildningsutskottets betänkande
1975/76:33 med anledning av propositionen 1975/76:118 om hemspråksundervisning
@@ -348,6 +382,8 @@
från valet till dess nytt val förrättats under början av nästa
valperiod:
</note>
+ </div>
+ <div type="commentSection">
<note xml:id="i-GqbUmgnrDRkBfMQQob3JdE">
18 § Justering av protokoll
</note>
@@ -1927,6 +1941,8 @@
för tisdagen den 11 oktober i ärende om subsidiaritetsprövning
av EU-förslag inkommit från konstitutionsutskottet.
</note>
+ </div>
+ <div type="commentSection">
<note xml:id="i-YDQoWL6d7NKrJ5rg8sjvFv">
8 § Hänvisning av ärenden till utskott
</note>
@@ -131,6 +145,8 @@
2018/19:FPM9 Gemensamt meddelande om förbindelserna mellan Europa
och Asien – byggstenar för en EU-strategi JOIN(2018) 31 till trafikutskottet
</note>
+ </div>
+ <div type="commentSection">
<note xml:id="i-GFTtu2wnzSNkbP7VPWxUYa">
§ 8 Anmälan om granskningsrapport
</note>
@@ -7241,6 +7313,8 @@
<note xml:id="i-9yzgVCyLV5Md3um8MSVQjW">
(Beslut skulle fattas den 23 februari.)
</note>
+ </div>
+ <div type="debateSection">
<note xml:id="i-K3Fid1jwQi92pU28ysPUev">
§ 12 Svar på interpellation 2021/22:321 om inrättandet av en
kriskommission om LSS
Also commentSection does not really make sense semantically. I would go with debateSection and otherSection for now.
edit: I saw this is the standard in the parlamint. But it hurts my eyes. So i would create our own sections here anyway. Simply because I think we will want to have a more elaborate sectioning further down the lines.
I think we should decide on a preliminary idea of how to adjust the divs now and I can implement it before we commit changes to the whole unicameral period. My thoughts:
first \
element under \ should probably not be tagged as a debate section
debate section divs have a type attrib with debate_ as a general value, and we can specify further as we go, e.g., debate_interpellationDebate and debate_interpellationQuestion
commentSection should probably be other or something generic for the time being to signal !debate
I just talked it over with @ninpnin -- we'll leave the commentSection/debateSection for now. It's easy enough to change later. Parlaclarin, specifies a subtype attribute, so that solves my main issue about classifying types of debates.
I see one check mark on an incorrect \
-- who should check the rest so we can get on with this?
Fair enough. Long term we probably want this information in tables anyways. Hence we should add IDs to the div tags just as we have for the notes and utterances.
That's reasonable -- do you want to check the divs are correct enough first? I think it's a short script to add an id to the div tags -- we have a uuid generator function in the pyriksdagen module.
When I have been thinking a little longer. If we would remove type from the tags later, this would mean that we actually change the API. So we should try to avoid it and fix this right away. I also think MetaSolution was quite clear that the data should just include IDs to simplify linking and adding metadata.
Hence, we should do this right away. I dont think its much work. This would mean:
Create a csv-file (called record_divisors.csv?) with column div_id and type. Im not sure in what folder we should store this.
I think this is a fundamentally different approach than what we have done so far.
So far, we have had a lot of annotations in the XML files. That's what ParlaClarin is for. Otherwise we would use tabular data, eg. CSVs for text too.
My current gut feeling is that our current approach works better with git.
Either way, I don't think we should add a new CSV now. Either we continue with our current approach, or change to a tabular structure later after more planning.
That is true. I think we get some conflicting best practices here. ParlaClarin as a format and MetaSolutions recommendations re using ids and linked data.
I agree with metasolutions long term, but you are right. Lets keep this as small as possible. Although we need to add id to all elements anyway since we gonna need to take samples of sections.
@@ -353,6 +353,8 @@
<note xml:id="i-PnUNJn84bxmRD9K6GUAbZf">
suppleant i utbildningsutskottet Sonia Thomasson (vpk)
</note>
+ </div>
+ <div type="commentSection" xml:id="i-S1pessFPXLM6rWYzEH1QjS">
<note xml:id="i-PBCqXtv8qLCcE4P18gCuVF">
3§ Talmannen meddelade att Ingemar Konradsson (s) denna dag återtagit
sin plats i riksdagen, varigenom Ulla-Britt Carlssons uppdrag
@@ -3175,15 +3183,21 @@
<note xml:id="i-9YZ5JzboDwZ4z63771xjK3">
Överläggningen var härmed avslutad.
</note>
+ </div>
+ <div type="commentSection" xml:id="i-BsCQM62oikNZ3ioKPCXuVk">
<note xml:id="i-DbjQrUu8GsGVNnAbuZjLbi">
11 § På förslag av talmannen beslöt kammaren kl. 11.10 att ajournera
sina förhandlingar till kl. 14.00, då de till dagens bordläggning
anmälda utskottsbetänkandena väntades föreligga.
</note>
+ </div>
+ <div type="commentSection" xml:id="i-BdgMumfDWQE9NA242JctvF">
<note xml:id="i-BGwYLbyW36NMRKdpzngTAC">
12 § Förhandlingarna återupptogs kl. 14.00 under ledning av förste
vice talmannen.
</note>
+ </div>
+ <div type="commentSection" xml:id="i-NDrQvVL2ZQjE8AcX4cttKZ">
<note xml:id="i-7upkPfaSsBkcRFxuFV6S8a">
13 § Anmäldes och bordlades Proposition 1983/84:128 Förslag till
lag om företagshypotek m. m.
@@ -320,6 +336,8 @@
AU1 samt näringsutskottets betänkanden NU1, NU2 och NU3 skulle
avgöras i ett sammanhang efter avslutad debatt.
</note>
+ </div>
+ <div type="debateSection" xml:id="i-3F8a7BPWWUveovCAFDHV9G">
<note xml:id="i-VURidbF1UszbSTjSQCsGf6">
9 § Ekonomisk trygghet vid arbetslöshet samt arbetsmarknad och
arbetsliv
Still the problem that tags becomes a section. This should be easy to fix?
I don't follow.
Also, an innehållsförteckning seem to incorrectly end up in a large number of sections. Is this easy to fix?
After merging this it's what I wanted to do first after taking a first crack at identifying the interpellation debates. I don't think it would be too difficult, but you never know until you actually start doing it.
At this stage -- given it's the first kind of attempt at creating sections -- unless there is something really bad, i.e. that worsens the quality of the data/work we've already done (which I don't see in the sample or in other edits), then we should accept this round of div additions.
I see many things that could be better, but I don't think we will get it all right at once. Some incorrect section delimitation is an improvement over no section delimitation.
moving solo \ elems (that's what I didn't get before) into adjacent divs
joining stray solo sections under a unified table of contents
finding additional section delimitation
--- by missing nrs in the sequence, and / or
--- finding the end of a real section before the end of the div, e.g.:
...can all be done in steps (minimal PR!), but if we sit on this for too long it blocks me from categorizing the debates
I fully agree.
1) I fully agree that we should do minimal PRs. That said, eg fixing the tag seem so small that it is just a quick fix (as a couple of lines of code). Then we might just fix it, right? The other issues seem to need som additional work.
2) The revision control: So we need to check that these divs are correct that includes the debateSection and commentSection. I guess we only check that the debatesection contain a real debate (or a section of a debate), right?
No it's not that much work to fix stray \ elems in a section, but...
2.
2.1. We wanted to do this quality control before committing edits to the whole set of protocols for reasons of economy. So either we approve what's here and I can commit it, then fix the pb thing with another commit (before merging the PR), or I can fix it now in the already modified files, but then we conflate 'types' of edits in one piece of the revision history.
2.2. Debate sections have intros, comment sections don't -- it seems like a reasonable criterion for evaluation. Should I check that? in the sample? I'd like to be able to take this a step or two forward today.
I was thinking of fixing all ? Not just the stray ones. An estimate is that roughly 2% of all edits are due to this problem?
2.a. Im not sure I followed. So I just checked for obvious errors and found those. If we fix those, we can get a new sample we can assess. That should not conflate anything or be problematic?
2.b. Great. I just wanted to know. Then it seems good to just check the debates based on this definition and check that the commentSections are not incorrect and that not incorrect divs are introduced.
But this raises an issue that we need to start to define divs in a better way. Because this is slightly in between an analytic decision and an data authentic one. And we want to be as close to the latter as possible.
I've gone through them now: mostly they're ok. Marked correct if:
div elem has id
schematically correct
debates have intros/comment sections have no intros
It looks like 6 are incorrect by those criteria and the incorrect ones are due to lone \ elems in a div or the content of the table of contents section getting tagged as section head and intros. I'll commit the rest of the protocols, then let's merge and I'll open issues for these two problems.
Here I use existing code (
scripts/split_into_sections.py
) to divide up the unicameral period protocols into sections (based on the§
character), delimited by \I also sneak in a script (
scripts/git-add_diff-sample.py
) which should work in tandem with @ninpnin 'ssample-git-diffs
, in order to quicklygit add
the files that were sampled from the diff.Sample for quality assessment to follow.
Sampled changes
corpus/protocols/1972/prot-1972--87.xml
Diff starting from line 8485
corpus/protocols/1973/prot-1973--139.xml
Diff starting from line 1519
corpus/protocols/1973/prot-1973--142.xml
Diff starting from line 7874
corpus/protocols/1973/prot-1973--92.xml
Diff starting from line 62
corpus/protocols/1974/prot-1974--116.xml
Diff starting from line 8059
corpus/protocols/1974/prot-1974--67.xml
Diff starting from line 6677
corpus/protocols/1975/prot-1975--99.xml
Diff starting from line 1502
corpus/protocols/197576/prot-197576--138.xml
Diff starting from line 11389
corpus/protocols/197879/prot-197879--82.xml
Diff starting from line 64
corpus/protocols/198384/prot-198384--102.xml
Diff starting from line 65
corpus/protocols/198384/prot-198384--132.xml
Diff starting from line 4692
corpus/protocols/198586/prot-198586--104.xml
Diff starting from line 4141
corpus/protocols/198788/prot-198788--64.xml
Diff starting from line 511
corpus/protocols/198990/prot-198990--18.xml
Diff starting from line 62
corpus/protocols/199091/prot-199091--115.xml
Diff starting from line 6619
corpus/protocols/199192/prot-199192--28.xml
Diff starting from line 6025
corpus/protocols/199394/prot-199394--24.xml
Diff starting from line 446
corpus/protocols/199394/prot-199394--54.xml
Diff starting from line 6308
corpus/protocols/199394/prot-199394--71.xml
Diff starting from line 80
corpus/protocols/199495/prot-199495--35.xml
Diff starting from line 784
corpus/protocols/199495/prot-199495--9.xml
Diff starting from line 97
corpus/protocols/199697/prot-199697--97.xml
Diff starting from line 4602
corpus/protocols/199899/prot-199899--30.xml
Diff starting from line 5572
corpus/protocols/199899/prot-199899--42.xml
Diff starting from line 5188
corpus/protocols/199899/prot-199899--5.xml
Diff starting from line 2186
corpus/protocols/19992000/prot-19992000--126.xml
Diff starting from line 1329
corpus/protocols/200102/prot-200102--110.xml
Diff starting from line 195
corpus/protocols/200304/prot-200304--42.xml
Diff starting from line 10604
corpus/protocols/200506/prot-200506--106.xml
Diff starting from line 1847
corpus/protocols/200506/prot-200506--114.xml
Diff starting from line 4562
corpus/protocols/200607/prot-200607--79.xml
Diff starting from line 112
corpus/protocols/200708/prot-200708--9.xml
Diff starting from line 94
corpus/protocols/200809/prot-200809--101.xml
Diff starting from line 6718
corpus/protocols/201011/prot-201011--127.xml
Diff starting from line 1881
corpus/protocols/201011/prot-201011--5.xml
Diff starting from line 382
corpus/protocols/201112/prot-201112--126.xml
Diff starting from line 3394
corpus/protocols/201112/prot-201112--16.xml
Diff starting from line 1941
corpus/protocols/201213/prot-201213--73.xml
Diff starting from line 18943
corpus/protocols/201314/prot-201314--21.xml
Diff starting from line 2642
corpus/protocols/201314/prot-201314--90.xml
Diff starting from line 8232
corpus/protocols/201415/prot-201415--57.xml
Diff starting from line 11625
corpus/protocols/201516/prot-201516--86.xml
Diff starting from line 88
corpus/protocols/201617/prot-201617--42.xml
Diff starting from line 10092
corpus/protocols/201617/prot-201617--53.xml
Diff starting from line 59
corpus/protocols/201617/prot-201617--6.xml
Diff starting from line 322
corpus/protocols/201718/prot-201718--112.xml
Diff starting from line 11597
corpus/protocols/201819/prot-201819--12.xml
Diff starting from line 145
corpus/protocols/201819/prot-201819--37.xml
Diff starting from line 251
corpus/protocols/201819/prot-201819--81.xml
Diff starting from line 11653
corpus/protocols/202122/prot-202122--70.xml
Diff starting from line 7313
The unit tests are failing?
its the schema test. some of the 202122 protocols are empty. I found it just before i went home, so not really sure what the cause of that is yet.
Seems like it captures page divs: corpus/protocols/201617/prot-201617--53.xml This should be easy to fix, I think.
Also commentSection does not really make sense semantically. I would go with debateSection and otherSection for now.
edit: I saw this is the standard in the parlamint. But it hurts my eyes. So i would create our own sections here anyway. Simply because I think we will want to have a more elaborate sectioning further down the lines.
Also. ParlaMint states that the first note after should be a header, so maybe add that as well?
ParlaMint is the more restrictive version of the two, a strict subset of ParlaClarin. I think we should use it as a suggestion.
In practice: sometimes the header is not available in our data, so I think we shouldn't put too much effort into following that rule.
I think we should decide on a preliminary idea of how to adjust the divs now and I can implement it before we commit changes to the whole unicameral period. My thoughts:
debate_
as a general value, and we can specify further as we go, e.g.,debate_interpellationDebate
anddebate_interpellationQuestion
commentSection
should probably beother
or something generic for the time being to signal !debateI just talked it over with @ninpnin -- we'll leave the commentSection/debateSection for now. It's easy enough to change later. Parlaclarin, specifies a subtype attribute, so that solves my main issue about classifying types of debates.
I see one check mark on an incorrect \
Fair enough. Long term we probably want this information in tables anyways. Hence we should add IDs to the div tags just as we have for the notes and utterances.
i suggest we just use uuid there as well.
That's reasonable -- do you want to check the divs are correct enough first? I think it's a short script to add an id to the div tags -- we have a uuid generator function in the pyriksdagen module.
the unit test fails because of a couple protocols in 2021/22 with no body. They're on the riksdag open data, will fix this in a separate PR.
When I have been thinking a little longer. If we would remove type from the tags later, this would mean that we actually change the API. So we should try to avoid it and fix this right away. I also think MetaSolution was quite clear that the data should just include IDs to simplify linking and adding metadata.
Hence, we should do this right away. I dont think its much work. This would mean:
Does this make sense?
I think this is a fundamentally different approach than what we have done so far.
So far, we have had a lot of annotations in the XML files. That's what ParlaClarin is for. Otherwise we would use tabular data, eg. CSVs for text too.
My current gut feeling is that our current approach works better with git.
Either way, I don't think we should add a new CSV now. Either we continue with our current approach, or change to a tabular structure later after more planning.
That is true. I think we get some conflicting best practices here. ParlaClarin as a format and MetaSolutions recommendations re using ids and linked data.
I agree with metasolutions long term, but you are right. Lets keep this as small as possible. Although we need to add id to all elements anyway since we gonna need to take samples of sections.
Im hesitant to merge a PR that doesnt pass the tests. So we should then try to fix that assp.
Here comes a new sample with id atribs in the div and 'empty' protocols in the 202122 year curated. Lets hope the unit tests pass :D
Sampled changes
corpus/protocols/1972/prot-1972--24.xml
Diff starting from line 3172
corpus/protocols/1973/prot-1973--120.xml
Diff starting from line 65
corpus/protocols/197879/prot-197879--79.xml
Diff starting from line 62
corpus/protocols/197879/prot-197879--90.xml
Diff starting from line 3430
corpus/protocols/197980/prot-197980--41.xml
Diff starting from line 5419
corpus/protocols/197980/prot-197980--56.xml
Diff starting from line 8043
corpus/protocols/198182/prot-198182--31.xml
Diff starting from line 64
corpus/protocols/198283/prot-198283--111.xml
Diff starting from line 353
corpus/protocols/198384/prot-198384--100.xml
Diff starting from line 3183
corpus/protocols/198384/prot-198384--155.xml
Diff starting from line 3523
corpus/protocols/198586/prot-198586--110.xml
Diff starting from line 819
corpus/protocols/198687/prot-198687--73.xml
Diff starting from line 366
corpus/protocols/199091/prot-199091--78.xml
Diff starting from line 59
corpus/protocols/199192/prot-199192--121.xml
Diff starting from line 10376
corpus/protocols/199293/prot-199293--71.xml
Diff starting from line 6983
corpus/protocols/199394/prot-199394--124.xml
Diff starting from line 10470
corpus/protocols/199495/prot-199495--40.xml
Diff starting from line 518
corpus/protocols/199495/prot-199495--76.xml
Diff starting from line 69
corpus/protocols/199899/prot-199899--17.xml
Diff starting from line 4845
corpus/protocols/199899/prot-199899--38.xml
Diff starting from line 336
corpus/protocols/19992000/prot-19992000--112.xml
Diff starting from line 3395
corpus/protocols/200001/prot-200001--35.xml
Diff starting from line 188
corpus/protocols/200001/prot-200001--56.xml
Diff starting from line 15540
corpus/protocols/200001/prot-200001--64.xml
Diff starting from line 59
corpus/protocols/200102/prot-200102--65.xml
Diff starting from line 70
corpus/protocols/200102/prot-200102--79.xml
Diff starting from line 4816
corpus/protocols/200304/prot-200304--25.xml
Diff starting from line 9060
corpus/protocols/200405/prot-200405--101.xml
Diff starting from line 1840
corpus/protocols/200405/prot-200405--49.xml
Diff starting from line 12097
corpus/protocols/200607/prot-200607--105.xml
Diff starting from line 6734
corpus/protocols/200607/prot-200607--111.xml
Diff starting from line 8544
corpus/protocols/200708/prot-200708--112.xml
Diff starting from line 11085
corpus/protocols/200708/prot-200708--138.xml
Diff starting from line 538
corpus/protocols/200809/prot-200809--46.xml
Diff starting from line 561
corpus/protocols/200910/prot-200910--11.xml
Diff starting from line 239
corpus/protocols/200910/prot-200910--145.xml
Diff starting from line 14647
corpus/protocols/201213/prot-201213--110.xml
Diff starting from line 2565
corpus/protocols/201314/prot-201314--106.xml
Diff starting from line 9972
corpus/protocols/201314/prot-201314--92.xml
Diff starting from line 1773
corpus/protocols/201415/prot-201415--121.xml
Diff starting from line 6277
corpus/protocols/201516/prot-201516--102.xml
Diff starting from line 8679
corpus/protocols/201516/prot-201516--118.xml
Diff starting from line 7422
corpus/protocols/201617/prot-201617--132.xml
Diff starting from line 5507
corpus/protocols/201617/prot-201617--26.xml
Diff starting from line 2246
corpus/protocols/201617/prot-201617--29.xml
Diff starting from line 6744
corpus/protocols/201617/prot-201617--71.xml
Diff starting from line 59
corpus/protocols/201718/prot-201718--16.xml
Diff starting from line 2097
corpus/protocols/201819/prot-201819--29.xml
Diff starting from line 896
corpus/protocols/201819/prot-201819--81.xml
Diff starting from line 12005
corpus/protocols/202021/prot-202021--12.xml
Diff starting from line 930
Any ideas how we formally know if it is correct or not?
Still the problem that tags becomes a section. This should be easy to fix?
Also, an innehållsförteckning seem to incorrectly end up in a large number of sections. Is this easy to fix?
I guess if the div is not empty, doesn't contain multiple sections, and has the type+id attribs.
I don't follow.
After merging this it's what I wanted to do first after taking a first crack at identifying the interpellation debates. I don't think it would be too difficult, but you never know until you actually start doing it.
At this stage -- given it's the first kind of attempt at creating sections -- unless there is something really bad, i.e. that worsens the quality of the data/work we've already done (which I don't see in the sample or in other edits), then we should accept this round of div additions.
I see many things that could be better, but I don't think we will get it all right at once. Some incorrect section delimitation is an improvement over no section delimitation.
...can all be done in steps (minimal PR!), but if we sit on this for too long it blocks me from categorizing the debates
I fully agree. 1) I fully agree that we should do minimal PRs. That said, eg fixing the tag seem so small that it is just a quick fix (as a couple of lines of code). Then we might just fix it, right? The other issues seem to need som additional work.
2) The revision control: So we need to check that these divs are correct that includes the debateSection and commentSection. I guess we only check that the debatesection contain a real debate (or a section of a debate), right?
No it's not that much work to fix stray \ elems in a section, but...
2.1. We wanted to do this quality control before committing edits to the whole set of protocols for reasons of economy. So either we approve what's here and I can commit it, then fix the pb thing with another commit (before merging the PR), or I can fix it now in the already modified files, but then we conflate 'types' of edits in one piece of the revision history.
2.2. Debate sections have intros, comment sections don't -- it seems like a reasonable criterion for evaluation. Should I check that? in the sample? I'd like to be able to take this a step or two forward today.
2.a. Im not sure I followed. So I just checked for obvious errors and found those. If we fix those, we can get a new sample we can assess. That should not conflate anything or be problematic?
2.b. Great. I just wanted to know. Then it seems good to just check the debates based on this definition and check that the commentSections are not incorrect and that not incorrect divs are introduced.
But this raises an issue that we need to start to define divs in a better way. Because this is slightly in between an analytic decision and an data authentic one. And we want to be as close to the latter as possible.
I've gone through them now: mostly they're ok. Marked correct if:
It looks like 6 are incorrect by those criteria and the incorrect ones are due to lone \ elems in a div or the content of the table of contents section getting tagged as section head and intros. I'll commit the rest of the protocols, then let's merge and I'll open issues for these two problems.
Great! Do you open an issue?
405
@MansMeg will you merge when the tests pass?