welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Extract interpellation debates from the corpus #386

Closed MansMeg closed 6 months ago

MansMeg commented 11 months ago

We want to extract the interpellation debates from the corpus. We believe these interpolation debates exist in the corpus during the whole period. So this task would include:

Extract interpellation debates

Quality Control We need to monitor how well we extract interpellation debates and simple questions.

ninpnin commented 11 months ago

Splitting the protocols in sections by the §-symbols seems to work pretty well (apart from Första Kammaren, which is a low priority anyway).

The code for it is here:

https://github.com/welfare-state-analytics/riksdagen-corpus/blob/main/scripts/split_into_sections.py

We currently don't use it at least for the 1920-2022 era.

ninpnin commented 11 months ago

In the unicameral era, the exact phrase "Svar på interpellationer" seems to yield decent results too:

corpus_year Number of "var på inter" observed
corpus/protocols/1970 738
corpus/protocols/1971 141
corpus/protocols/1972 132
corpus/protocols/1973 125
corpus/protocols/1974 120
corpus/protocols/1975 76
corpus/protocols/197576 172
corpus/protocols/197677 131
corpus/protocols/197778 183
corpus/protocols/197879 232
corpus/protocols/197980 310
corpus/protocols/1980 0
corpus/protocols/198081 183
corpus/protocols/198182 391
corpus/protocols/198283 328
corpus/protocols/198384 388
corpus/protocols/198485 366
corpus/protocols/198586 377
corpus/protocols/198687 378
corpus/protocols/198788 437
corpus/protocols/198889 388
corpus/protocols/198990 1474
corpus/protocols/199091 181
corpus/protocols/199192 183
corpus/protocols/199293 154
corpus/protocols/199394 134
corpus/protocols/199495 137
corpus/protocols/199596 235
corpus/protocols/199697 333
corpus/protocols/199798 272
corpus/protocols/199899 359
corpus/protocols/19992000 379
corpus/protocols/200001 418
corpus/protocols/200102 426
corpus/protocols/200203 380
corpus/protocols/200304 468
corpus/protocols/200405 584
corpus/protocols/200506 442
corpus/protocols/200607 598
corpus/protocols/200708 691
corpus/protocols/200809 501
corpus/protocols/200910 388
corpus/protocols/201011 374
corpus/protocols/201112 385
corpus/protocols/201213 433
corpus/protocols/201314 448
corpus/protocols/201415 1236
corpus/protocols/201516 1295
corpus/protocols/201617 964
corpus/protocols/201718 1001
corpus/protocols/201819 438
corpus/protocols/201920 732
corpus/protocols/202021 1436
corpus/protocols/202122 699
BobBorges commented 11 months ago

Thanks for posting these! My initial idea was just some kind of combination of these two strategies.

BTW, do we have longer-term ambitions to chunk up the protocols into categorized \

elements?

MansMeg commented 11 months ago

Yes, in the long term, we want to chunk the records into debates, such as @ninpnin showed during the meeting. I'm not sure about using a div element or attributes "header" on the notes for the headers. So maybe keep that in mind. How do we get this structure (the headers) into the corpus? Ideally, for the whole corpus.

BobBorges commented 11 months ago

I talked to @ninpnin yesterday about this and planned to open a PR of chunked unicameral protocols (with div) today as a kind of pre-step to identifying interpellation debates.

My feeling right now is that delimiting sections with \

elems would make the data easier to work with -- parse tree, get elem div with attrib type="interpellationDebate" (or whatever), and you're ready. If you would just label the notes as a header, you'd need to find notes with type="header" and consider from there until you find another header to be a section --- it's not as clean to work with.

@MansMeg are there arguments against the div strategy?

For the unicameral period, the code is ready to try making divs. It might also work in andrakammaren, but not in forstakammaren (they don't use "§").

ninpnin commented 11 months ago

ParlaClarin / ParlaMint suggests using div's https://clarin-eric.github.io/ParlaMint/#chp-div

MansMeg commented 11 months ago

I dont think there are any downsides with this approach. We should though check the quality of the divs. But thats a second step that could be easily done, I think. This was what I meant might be quicker way. So Im all for this. Maybe start with the unicameral era to keep the PRs manageable?

BobBorges commented 11 months ago

There's an open PR with sectioning --> \

Potentially sufficient to start labeling sections, but I wait for some of you to look at the sample.

BobBorges commented 10 months ago

Some small win on this issue: I grabbed a list of all the IP questions from riksdag open data (1998-- ), and checked the question numbers of the ones that have the status as answered (besvarad) against the debate sections that I identified as IP debates (1998--2021/22) and only 2 of 4025 aren't captured by the sectioning. It's less clear to me how to interpret some of the other values (korrekturläst, Skickad), but if we include them, catching the interpellation debates by searching section headers in our new divs remains above 95%.

image

MansMeg commented 10 months ago

This is promising. Maybe we can use this as quality control as well?

Im not sure what the different types mean? Maybe @DrJosefsson knows what is the interesting part? I guess ”besvarad” is the important one?

DrJosefsson commented 10 months ago

Great! Just to make sure I understand this correctly: you find 96 % of the interpellation debates connected to the interpellations that are categorized as besvarad OR korrekturläst OR skickad. That sounds really good to me. But I don't get why not all of those three categories are just categorized as "besvarade" - I don't get the difference between the categories. I would say that as long as they are debated in the chamber we are interested in all three categories.

The ones that are withdrawn by the legislator who initially wrote the question should not be debated.

BobBorges commented 10 months ago

you find 96 % of the interpellation debates connected to the interpellations that are categorized as besvarad OR korrekturläst OR skickad

exactly, for the period 1998 -- 2022, when riksdag open data has published. Some of those years this status column is empty, hence the 3k NaN (== empty) values. But if these three categories are the ones that indicate a question was debated, then we're on the right track.

DrJosefsson commented 10 months ago

Sounds good to me!

BobBorges commented 10 months ago

@DrJosefsson @joeri450 -- I checked your manual annotations against the automated extraction, and 5 of 165 (3.03%) are incorrect, if we consider that formally "fråga" is not an ip debate. The good news is those 5 are all false negatives, meaning I didn't tag an intro as part of IP debate that is actually part of one, and those 5 are all in the 1970s, so we knew about that and I will fix the 1970s soon. No instances of calling an intro part of an ip debate when it's something else.

image

DrJosefsson commented 10 months ago

Great, thanks @BobBorges !

MansMeg commented 10 months ago

Ok. So, we are close to 100% accuracy now. I guess it is up to you @DrJosefsson, if this is good enough. I guess @BobBorges also needs to do some final fixes on the 1970s. I guess we can discuss the next steps tomorrow?