Extract interpellation debates from the corpus

We want to extract the interpellation debates from the corpus. We believe these interpolation debates exist in the corpus during the whole period. So this task would include:

Extract interpellation debates

[x] Identify the start and end of the section containing interpellation debates using a quick approach (regexp?).

Quality Control We need to monitor how well we extract interpellation debates and simple questions.

[x] Generate a stratified random sample of 2 introductions/speeches per year together with a link to a page that Josefin and @DrJosefsson can annotate whether they are interpellation debates by checking the source.
[x] #415
[x] We cross-reference the data with the open data from the parliament, which has had this information since around 1994. We will verify that we are capturing roughly the same information.
[x] Josefin and @DrJosefsson extract the true number of interpellation debates during the unicameral era per year to be used for comparison.
[ ] Visualize the number of speeches in interpellation debates by the minister's gender and parliamentary year.

Splitting the protocols in sections by the §-symbols seems to work pretty well (apart from Första Kammaren, which is a low priority anyway).

The code for it is here:

https://github.com/welfare-state-analytics/riksdagen-corpus/blob/main/scripts/split_into_sections.py

We currently don't use it at least for the 1920-2022 era.

In the unicameral era, the exact phrase "Svar på interpellationer" seems to yield decent results too:

corpus_year	Number of "var på inter" observed
corpus/protocols/1970	738
corpus/protocols/1971	141
corpus/protocols/1972	132
corpus/protocols/1973	125
corpus/protocols/1974	120
corpus/protocols/1975	76
corpus/protocols/197576	172
corpus/protocols/197677	131
corpus/protocols/197778	183
corpus/protocols/197879	232
corpus/protocols/197980	310
corpus/protocols/1980	0
corpus/protocols/198081	183
corpus/protocols/198182	391
corpus/protocols/198283	328
corpus/protocols/198384	388
corpus/protocols/198485	366
corpus/protocols/198586	377
corpus/protocols/198687	378
corpus/protocols/198788	437
corpus/protocols/198889	388
corpus/protocols/198990	1474
corpus/protocols/199091	181
corpus/protocols/199192	183
corpus/protocols/199293	154
corpus/protocols/199394	134
corpus/protocols/199495	137
corpus/protocols/199596	235
corpus/protocols/199697	333
corpus/protocols/199798	272
corpus/protocols/199899	359
corpus/protocols/19992000	379
corpus/protocols/200001	418
corpus/protocols/200102	426
corpus/protocols/200203	380
corpus/protocols/200304	468
corpus/protocols/200405	584
corpus/protocols/200506	442
corpus/protocols/200607	598
corpus/protocols/200708	691
corpus/protocols/200809	501
corpus/protocols/200910	388
corpus/protocols/201011	374
corpus/protocols/201112	385
corpus/protocols/201213	433
corpus/protocols/201314	448
corpus/protocols/201415	1236
corpus/protocols/201516	1295
corpus/protocols/201617	964
corpus/protocols/201718	1001
corpus/protocols/201819	438
corpus/protocols/201920	732
corpus/protocols/202021	1436
corpus/protocols/202122	699

Thanks for posting these! My initial idea was just some kind of combination of these two strategies.

BTW, do we have longer-term ambitions to chunk up the protocols into categorized \

elements?

Yes, in the long term, we want to chunk the records into debates, such as @ninpnin showed during the meeting. I'm not sure about using a div element or attributes "header" on the notes for the headers. So maybe keep that in mind. How do we get this structure (the headers) into the corpus? Ideally, for the whole corpus.

I talked to @ninpnin yesterday about this and planned to open a PR of chunked unicameral protocols (with div) today as a kind of pre-step to identifying interpellation debates.

My feeling right now is that delimiting sections with \

elems would make the data easier to work with -- parse tree, get elem div with attrib type="interpellationDebate" (or whatever), and you're ready. If you would just label the notes as a header, you'd need to find notes with type="header" and consider from there until you find another header to be a section --- it's not as clean to work with.

@MansMeg are there arguments against the div strategy?

For the unicameral period, the code is ready to try making divs. It might also work in andrakammaren, but not in forstakammaren (they don't use "§").

ParlaClarin / ParlaMint suggests using div's https://clarin-eric.github.io/ParlaMint/#chp-div

I dont think there are any downsides with this approach. We should though check the quality of the divs. But thats a second step that could be easily done, I think. This was what I meant might be quicker way. So Im all for this. Maybe start with the unicameral era to keep the PRs manageable?

There's an open PR with sectioning --> \

Potentially sufficient to start labeling sections, but I wait for some of you to look at the sample.

Some small win on this issue: I grabbed a list of all the IP questions from riksdag open data (1998-- ), and checked the question numbers of the ones that have the status as answered (besvarad) against the debate sections that I identified as IP debates (1998--2021/22) and only 2 of 4025 aren't captured by the sectioning. It's less clear to me how to interpret some of the other values (korrekturläst, Skickad), but if we include them, catching the interpellation debates by searching section headers in our new divs remains above 95%.

This is promising. Maybe we can use this as quality control as well?

Im not sure what the different types mean? Maybe @DrJosefsson knows what is the interesting part? I guess ”besvarad” is the important one?

Great! Just to make sure I understand this correctly: you find 96 % of the interpellation debates connected to the interpellations that are categorized as besvarad OR korrekturläst OR skickad. That sounds really good to me. But I don't get why not all of those three categories are just categorized as "besvarade" - I don't get the difference between the categories. I would say that as long as they are debated in the chamber we are interested in all three categories.

The ones that are withdrawn by the legislator who initially wrote the question should not be debated.

you find 96 % of the interpellation debates connected to the interpellations that are categorized as besvarad OR korrekturläst OR skickad

exactly, for the period 1998 -- 2022, when riksdag open data has published. Some of those years this status column is empty, hence the 3k NaN (== empty) values. But if these three categories are the ones that indicate a question was debated, then we're on the right track.

@DrJosefsson @joeri450 -- I checked your manual annotations against the automated extraction, and 5 of 165 (3.03%) are incorrect, if we consider that formally "fråga" is not an ip debate. The good news is those 5 are all false negatives, meaning I didn't tag an intro as part of IP debate that is actually part of one, and those 5 are all in the 1970s, so we knew about that and I will fix the 1970s soon. No instances of calling an intro part of an ip debate when it's something else.

Great, thanks @BobBorges !

Ok. So, we are close to 100% accuracy now. I guess it is up to you @DrJosefsson, if this is good enough. I guess @BobBorges also needs to do some final fixes on the 1970s. I guess we can discuss the next steps tomorrow?

welfare-state-analytics / riksdagen-corpus

Extract interpellation debates from the corpus #386