welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

MP unit test coverage #371

Closed BobBorges closed 4 months ago

BobBorges commented 11 months ago

I built unit tests for MPs on the known_mp_catalog. However, after looking at the issue posted by @TomasSkotare #203, I've come to realize there are MPs in our metadata that aren't listed in the bio books (which we must have realized at some point because it's listed in wikidata for some of these entries).

image

This means, unfortunately, that the list of partyless MPs generated under #349 is incomplete and other unit tests that rely on this catalog are potentially not comprehensive. Some of the list from #203 are now covered by a party on wikidata, but the ones that aren't also seem not to be in the #349 list. To fix:

fredrik1984 commented 11 months ago

Hm. Are these "MPs" really MPs? We know for example that some persons who spoke in the riksdag were only ministers and never voted to be a MP but still addressed the parliament sometimes. For example, Ulf Dinkelspiel (https://www.wikidata.org/wiki/Q5622212) and Christian Günther (https://www.wikidata.org/wiki/Q1777531). Looking at the top of Tomas list in #203 several seem to be ministers who were never an MP. The same goes for the monarch, who I suppose is part of our list of speakers in the parliament.

But yes, it would be good to have a source for these speaking not-MPs.

BobBorges commented 11 months ago

Bad news: 1,695 wiki_ids are not in the MP catalog Not-so-bad-news: only 154 of them have no party info

fredrik1984 commented 11 months ago

Bad news: 1,695 wiki_ids are not in the MP catalog

I don't get this. Why are there so many wiki-ids that are not in the MP catalog? Are these MPs after the bio books? @BobBorges please explain this, and what it means, if you can.

BobBorges commented 11 months ago

I don't have a good explanation for this just yet. It means our tests are not testing everything er think they are, so it's top of my list to sort out. But here's the first person I look up from the list of IDs not in the catalog. We've got Gustaf with his brother's DOB and wiki ID. This will account for at least part of the problem. image

BobBorges commented 11 months ago

Second person: only one entry in the mp catalog Bio book reference appears to be for the guy who's not in the catalog. Guy in our catalog is on wikidata as andrakammarledamot, starting his mandate 56 years before he was born. image image

BobBorges commented 11 months ago

Third guy (Q5948965): Not in our metadata or catalog, despite two roles in FK. I don't understand why our query doesn't catch this guy.

image

image

BobBorges commented 11 months ago

Number 4: duplicate on wikidata -- only one in the mp catalog

image

fredrik1984 commented 11 months ago

Ok, stange. Maybe @ninpnin and @MansMeg have some idea about why this is happening?

MansMeg commented 11 months ago

This is great that you spotted this @BobBorges ! This is an additional argument for an iterative approach.

So I can see three reasons: 1) People are missing in the biobooks person registry. We built the checks on the biobooks so we only have garantuees that people included in the registers of the biobooks are included. We assumed this was everyone. 2) Duplicates (as one example indicates). The check will only check that one is included. But these are not real misses. 3) There are errors in Emils file/the unit test.

It seems like at least 1 and 2 can be a reason. I think 1) is the most sever if it is true. That could also explain the difference in quality during the 19th century. But where do we find these people if not in the biobooks?

also, then the chairs data become even more important since that will check this type of consistancy.

We should probably add a unit test on birthdate, dearhdate and mandate periods when we get the mandate period dates up to date.

althogh, the quality plot does not seem to indicate such a large number of missing persons (ie the no of persons per year).

BobBorges commented 11 months ago

I have some ideas about partially automating solutions to some of these problems. I will check a few more by hand and try to implement some fixes tomorrow.

I don't understand the example who has a role, but wasn't returned by the query.

MansMeg commented 11 months ago

We should also think of good data integrity checks to try to capture these type of errors in the future.

fredrik1984 commented 11 months ago

It seems like at least 1 and 2 can be a reason. I think 1) is the most sever if it is true. That could also explain the difference in quality during the 19th century. But where do we find these people if not in the biobooks?

My hunch is that it can't be that many missing from the biobook register. Another source to look for MPs are the matriklar in the riksdag registers, and maybe the annual Statskalendern

BobBorges commented 11 months ago

This issue now seems to be less severe than I thought yesterday! Let me whittle some of the problem away

I started with a list of ca 1700 wiki IDs that aren't in our catalog.

Verifying / fixing a couple hundred potential mistakes will be much better than 1,700!

MansMeg commented 11 months ago

So if I understand correctly, only 175 could be potential errors with respect to our testset on the biobook registry?

So the rest are they after 1993/94? I think that is equally bad? Its just that we dont have a unit test for this?