Closed MansMeg closed 6 months ago
Splitting the protocols in sections by the §-symbols seems to work pretty well (apart from Första Kammaren, which is a low priority anyway).
The code for it is here:
https://github.com/welfare-state-analytics/riksdagen-corpus/blob/main/scripts/split_into_sections.py
We currently don't use it at least for the 1920-2022 era.
In the unicameral era, the exact phrase "Svar på interpellationer" seems to yield decent results too:
corpus_year | Number of "var på inter" observed |
---|---|
corpus/protocols/1970 | 738 |
corpus/protocols/1971 | 141 |
corpus/protocols/1972 | 132 |
corpus/protocols/1973 | 125 |
corpus/protocols/1974 | 120 |
corpus/protocols/1975 | 76 |
corpus/protocols/197576 | 172 |
corpus/protocols/197677 | 131 |
corpus/protocols/197778 | 183 |
corpus/protocols/197879 | 232 |
corpus/protocols/197980 | 310 |
corpus/protocols/1980 | 0 |
corpus/protocols/198081 | 183 |
corpus/protocols/198182 | 391 |
corpus/protocols/198283 | 328 |
corpus/protocols/198384 | 388 |
corpus/protocols/198485 | 366 |
corpus/protocols/198586 | 377 |
corpus/protocols/198687 | 378 |
corpus/protocols/198788 | 437 |
corpus/protocols/198889 | 388 |
corpus/protocols/198990 | 1474 |
corpus/protocols/199091 | 181 |
corpus/protocols/199192 | 183 |
corpus/protocols/199293 | 154 |
corpus/protocols/199394 | 134 |
corpus/protocols/199495 | 137 |
corpus/protocols/199596 | 235 |
corpus/protocols/199697 | 333 |
corpus/protocols/199798 | 272 |
corpus/protocols/199899 | 359 |
corpus/protocols/19992000 | 379 |
corpus/protocols/200001 | 418 |
corpus/protocols/200102 | 426 |
corpus/protocols/200203 | 380 |
corpus/protocols/200304 | 468 |
corpus/protocols/200405 | 584 |
corpus/protocols/200506 | 442 |
corpus/protocols/200607 | 598 |
corpus/protocols/200708 | 691 |
corpus/protocols/200809 | 501 |
corpus/protocols/200910 | 388 |
corpus/protocols/201011 | 374 |
corpus/protocols/201112 | 385 |
corpus/protocols/201213 | 433 |
corpus/protocols/201314 | 448 |
corpus/protocols/201415 | 1236 |
corpus/protocols/201516 | 1295 |
corpus/protocols/201617 | 964 |
corpus/protocols/201718 | 1001 |
corpus/protocols/201819 | 438 |
corpus/protocols/201920 | 732 |
corpus/protocols/202021 | 1436 |
corpus/protocols/202122 | 699 |
Thanks for posting these! My initial idea was just some kind of combination of these two strategies.
BTW, do we have longer-term ambitions to chunk up the protocols into categorized \
Yes, in the long term, we want to chunk the records into debates, such as @ninpnin showed during the meeting. I'm not sure about using a div element or attributes "header" on the notes for the headers. So maybe keep that in mind. How do we get this structure (the headers) into the corpus? Ideally, for the whole corpus.
I talked to @ninpnin yesterday about this and planned to open a PR of chunked unicameral protocols (with div) today as a kind of pre-step to identifying interpellation debates.
My feeling right now is that delimiting sections with \
@MansMeg are there arguments against the div strategy?
For the unicameral period, the code is ready to try making divs. It might also work in andrakammaren, but not in forstakammaren (they don't use "§").
ParlaClarin / ParlaMint suggests using div's https://clarin-eric.github.io/ParlaMint/#chp-div
I dont think there are any downsides with this approach. We should though check the quality of the divs. But thats a second step that could be easily done, I think. This was what I meant might be quicker way. So Im all for this. Maybe start with the unicameral era to keep the PRs manageable?
There's an open PR with sectioning --> \
Potentially sufficient to start labeling sections, but I wait for some of you to look at the sample.
Some small win on this issue: I grabbed a list of all the IP questions from riksdag open data (1998-- ), and checked the question numbers of the ones that have the status as answered (besvarad) against the debate sections that I identified as IP debates (1998--2021/22) and only 2 of 4025 aren't captured by the sectioning. It's less clear to me how to interpret some of the other values (korrekturläst, Skickad), but if we include them, catching the interpellation debates by searching section headers in our new divs remains above 95%.
This is promising. Maybe we can use this as quality control as well?
Im not sure what the different types mean? Maybe @DrJosefsson knows what is the interesting part? I guess ”besvarad” is the important one?
Great! Just to make sure I understand this correctly: you find 96 % of the interpellation debates connected to the interpellations that are categorized as besvarad OR korrekturläst OR skickad. That sounds really good to me. But I don't get why not all of those three categories are just categorized as "besvarade" - I don't get the difference between the categories. I would say that as long as they are debated in the chamber we are interested in all three categories.
The ones that are withdrawn by the legislator who initially wrote the question should not be debated.
you find 96 % of the interpellation debates connected to the interpellations that are categorized as besvarad OR korrekturläst OR skickad
exactly, for the period 1998 -- 2022, when riksdag open data has published. Some of those years this status column is empty, hence the 3k NaN (== empty) values. But if these three categories are the ones that indicate a question was debated, then we're on the right track.
Sounds good to me!
@DrJosefsson @joeri450 -- I checked your manual annotations against the automated extraction, and 5 of 165 (3.03%) are incorrect, if we consider that formally "fråga" is not an ip debate. The good news is those 5 are all false negatives, meaning I didn't tag an intro as part of IP debate that is actually part of one, and those 5 are all in the 1970s, so we knew about that and I will fix the 1970s soon. No instances of calling an intro part of an ip debate when it's something else.
Great, thanks @BobBorges !
Ok. So, we are close to 100% accuracy now. I guess it is up to you @DrJosefsson, if this is good enough. I guess @BobBorges also needs to do some final fixes on the 1970s. I guess we can discuss the next steps tomorrow?
We want to extract the interpellation debates from the corpus. We believe these interpolation debates exist in the corpus during the whole period. So this task would include:
Extract interpellation debates
Quality Control We need to monitor how well we extract interpellation debates and simple questions.