Closed BobBorges closed 8 months ago
In order to best decide where to start improving the chairs/party coverage, I'm attaching a summary stats json file where you can look at coverage of each year/chamber (delete .txt
from the extension -- github doesn't allow json :| ).
Edit: new eval file -- metafile_eval.json.txt
Start and end dates are still an issue here. The chairs data came with "parliament_year" as a column, and like our other metadata, the actual start & end dates often need to be inferred. It would be good, I think, if we use this as an impetus to create a metadata file with sth like parliament_year
, start
, end
, which will be useful for the chairs and other parts of our work.
For those names that we now have a corresponding wiki_id, I can try to pull start/end dates from the metadata if it is deemed necessary.
There's some junk in the chairs data party columns. I could fairly easily remove/change it, which may also improve the matching, but I don't feel completely competent to decide what's complete junk and which junk might be recognizable as not-junk (e.g. in the case of an ocr error).
This looks great Bob!
Some quick comments. We should add a test that we have all chars and not incorrect chairs in the chairs.csv. Now I saw a value of 1300ish as a chair that is obvious incorrect. So I suggest we remove such entries.
I think your idea of creating a parliament_year.csv metadatafile make a lot of sense. Should we do it in this PR or seperately? Im not sure if this data exist on wikidata?
Also, this is alot if info? Maybe you could do a presentation about this friday?
My general thought is that we can try to see how far we can come with computational means. Then the vote protocol project can fill in the gaps when the general structure is set.
I think your idea of creating a parliament_year.csv metadatafile make a lot of sense. Should we do it in this PR or seperately? Im not sure if this data exist on wikidata?
@MansMeg @BobBorges of course there is such a list!: https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_svenska_riksdagar
I also think it would be good with a short chair status presentation on Friday when Jan Teorell is there as well. The Friday meeting will most likely circle around chair and party
Of course there is such a list. :)
I've got the list and making a metadata file out of it. @MansMeg I didn't want to delete data, but indeed out of range chair nrs should not be allowed in the file.
Should this PR include the code used for data processing?
Should this PR include the code used for data processing?
I'm happy to add whatever code if the rest of you want to have it. I didn't already put it in the PR b/c I guess once the metadata files are in the corpus, we probably won't reuse that code again.
we probably won't reuse that code again
That sounds like famous last words..
But if Måns agrees, let's just have the PR without the code.
famous last words
Code is not deleted, just not committed to the repo :D
Long term we dont want the code in this repo anyway (data is the state, not the code).
Today I sent the chairs project
I can add/edit data that will be returned to me and try again to match unmatched names to wiki-ids.
With matched name-wiki_ids, we can triangulate wikidata MP metadata with their project data to both verify data and potentially fill in gaps in either project's data set.
Part of the discussion in old duplicate issues was about helping the chairs project with extracting data. It fell through the cracks a little, but I've been looking at it today. OCR made a mess of the tables in some of the documents, so there's no easy way to do it - I don't know what @JoeNoonan already tried, but there does seem to be some pattern to that mess. So depending on how long Lirre (someone please tag him) thinks manually generating 1800+ rows, maybe we could try mixing a bit of code and manual work to wrangle up the missing data faster.
I've now formatted the chairs.csv
and chair_mp.csv
files according to our ideas about metadata and added 6 unit tests about the metadata integrity -- one will fail still because there are 97 wiki ID in two chairs at once. These obv need to get sorted out somehow -- I think I will just add it to my list of things to do manually. I looked at a handful of them and the reasons may be:
There's no unit test for coverage, because we aren't up to speed with that yet. And I could add some other triangulation tests, like we discussed elsewhere -- the person is associated with a chair in the same general time frame (year) as their mandate.
@Lirre @JoeNoonan I sill have the aggregated chair/chair_mp file if you need that.
If you have a lot to do, you can also just create a google sheet that @fredrik1984 or @salgo60 could work with re the 97. I have the impression that they both like this type of bug hunting. =)
I checked the names in the original chairs data against the wiki_id's Lirre filled in, and there are a dozen that I think need double checking.
As an example:
The top row is the wiki_id + name from Lirre's work and the table immediately below are all the name variants for that wiki_id in our metadata.
@SimonHallen
I guess it would add value if all those spellings of a persons name could be added as alias in WD
in WD we have chairs see below I am not sure a chair in "Riksdagen" is worth tracking for WD....
These chairs probably have a different meaning than a seat in the Riksdagen (The Swedish Parliament) but its still a chair 😄
Well spotted! @SimonHallen , I think you need to double check the wikiids in @BobBorges file.
I will take a look on it!
@MansMeg I can't see where to mark that your requested changes have been implemented, but when tests pass here, this will be ready.
Great! As long as the tests pass Im happy.
Just a question, I cant see exactly which files has been changed due to there being so many files? I would not expect it to be changes in this many files. Why is this the case?
I needed the stuff I did in the SWERIK-ID branch, so I pulled from there.
Ah. Of course. Also this still fails?
sorry :|
Here comes some metadata related to chairs that I've been working on. What's here:
files
chairs.csv
: this is a kind of map file with columnschair_id
(arbitrary uuid),chair_nr
, andchamber
chair_mp.csv
: this is a file where I tried to match the name in the data we got from the Teorell project ("chairs data" henceforth) to the wiki_id of an individual MP.seat_id
,seat_nr
,chamber
,wiki_id
,parliament_year
,start
,end
,name
,iort
,chairs_party
,wikidata_party
,birth_year
--- mostly they are self explanatory, but:seat_id
andseat_nr
should change to matchchair_id
,chair_nr
(or vice versa)chairs_party
andwikidata_party
are party info from the chairs data/wikidata respectivelychair_id
,wiki_id
,start
,end
summary stats
17,40720,993 of those names with a wiki_id (62,9975.94% of names)party coverage
8,48319,094 rows with non-null party assignment (30.769.07% of rows w/ a name, assuming rows w/ no name also don't have a party)14,22017,432 (51.4560.06% of rows w/name)12,16112,565 rows (4445.45%) have null in either (but not both)chairs_party
orwikidata_party
-- this is good because we can improve coverage of both data sets with this info.6,8568,495 rows (25.1730.73%) have an exact match (case insensitive) in thechairs_party
andwikidata_party
chairs_party
andwikidata_party
because I'm not sure if the two data sets use the same abbreviationsEdit: updated numbers after bug fix.