adding metadata related to chairs

BobBorges commented 1 year ago

Here comes some metadata related to chairs that I've been working on. What's here:

files

chairs.csv: this is a kind of map file with columns chair_id (arbitrary uuid), chair_nr, and chamber
chair_mp.csv: this is a file where I tried to match the name in the data we got from the Teorell project ("chairs data" henceforth) to the wiki_id of an individual MP.
--- right now, I left a bunch of columns in this file that should be removed later, for the purpose of easily evaluating quality of the matching and adding additional information
--- current columns: seat_id, seat_nr, chamber, wiki_id, parliament_year, start, end, name, iort, chairs_party, wikidata_party, birth_year --- mostly they are self explanatory, but:
--- --- seat_id and seat_nr should change to match chair_id, chair_nr (or vice versa)
--- --- chairs_party and wikidata_party are party info from the chairs data/wikidata respectively
--- planned cols for this file: chair_id, wiki_id, start, end

summary stats

merging the chairs data files amounted to 29,501 lines, of which 27,644 had a name string
using the same algorithms we use to match intros in the protocols, we matched ~~17,407~~ 20,993 of those names with a wiki_id (~~62,99~~ 75.94% of names)

party coverage

the chairs data came with ~~8,483~~ 19,094 rows with non-null party assignment (~~30.7~~ 69.07% of rows w/ a name, assuming rows w/ no name also don't have a party)
the name-matched wiki IDs with party coverage are ~~14,220~~ 17,432 (~~51.45~~ 60.06% of rows w/name)
skimming the csv file, one can see many rows were one data set has party coverage and the other doesn't; ~~12,161~~ 12,565 rows (44 45.45%) have null in either (but not both) chairs_party or wikidata_party -- this is good because we can improve coverage of both data sets with this info.
~~6,856~~ 8,495 rows (~~25.17~~ 30.73%) have an exact match (case insensitive) in the chairs_party and wikidata_party
I didn't calculate mismatches between chairs_party and wikidata_party because I'm not sure if the two data sets use the same abbreviations

Edit: updated numbers after bug fix.

BobBorges commented 1 year ago

In order to best decide where to start improving the chairs/party coverage, I'm attaching a summary stats json file where you can look at coverage of each year/chamber (delete .txt from the extension -- github doesn't allow json :| ).

N_rows: number of rows for the chamber/year combo
missing names: number of rows without a name string
missing_wiki_id: rows where the name didn't match to a wiki_id
baseline_N_chairs: should-be N chairs
out_of_range: if a chair number isn't within the expected limits of N_chairs given the year/chamber
duplicate_chairs: those which occur more than once
missing_chairs: chair numbers not present in the sequence of 1:baseline_N_chairs
chairs_party_coverage: non-null party vals form the chairs data
wikidata_party_coverage: non-null party vals form wikidata after name matching
party_coverage_xor: one party column has null value the other doesn't
party_coverage_exact_match: case insensitive match of both party columns

metafile_eval.json.txt

Edit: new eval file -- metafile_eval.json.txt

BobBorges commented 1 year ago

dates

Start and end dates are still an issue here. The chairs data came with "parliament_year" as a column, and like our other metadata, the actual start & end dates often need to be inferred. It would be good, I think, if we use this as an impetus to create a metadata file with sth like parliament_year, start, end, which will be useful for the chairs and other parts of our work.

For those names that we now have a corresponding wiki_id, I can try to pull start/end dates from the metadata if it is deemed necessary.

BobBorges commented 1 year ago

party coverage

There's some junk in the chairs data party columns. I could fairly easily remove/change it, which may also improve the matching, but I don't feel completely competent to decide what's complete junk and which junk might be recognizable as not-junk (e.g. in the case of an ocr error).

party_counts.json.txt

MansMeg commented 1 year ago

This looks great Bob!

Some quick comments. We should add a test that we have all chars and not incorrect chairs in the chairs.csv. Now I saw a value of 1300ish as a chair that is obvious incorrect. So I suggest we remove such entries.

I think your idea of creating a parliament_year.csv metadatafile make a lot of sense. Should we do it in this PR or seperately? Im not sure if this data exist on wikidata?

Also, this is alot if info? Maybe you could do a presentation about this friday?

MansMeg commented 1 year ago

My general thought is that we can try to see how far we can come with computational means. Then the vote protocol project can fill in the gaps when the general structure is set.

fredrik1984 commented 12 months ago

I think your idea of creating a parliament_year.csv metadatafile make a lot of sense. Should we do it in this PR or seperately? Im not sure if this data exist on wikidata?

@MansMeg @BobBorges of course there is such a list!: https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_svenska_riksdagar

I also think it would be good with a short chair status presentation on Friday when Jan Teorell is there as well. The Friday meeting will most likely circle around chair and party

MansMeg commented 12 months ago

Of course there is such a list. :)

BobBorges commented 12 months ago

I've got the list and making a metadata file out of it. @MansMeg I didn't want to delete data, but indeed out of range chair nrs should not be allowed in the file.

ninpnin commented 12 months ago

Should this PR include the code used for data processing?

BobBorges commented 12 months ago

Should this PR include the code used for data processing?

I'm happy to add whatever code if the rest of you want to have it. I didn't already put it in the PR b/c I guess once the metadata files are in the corpus, we probably won't reuse that code again.

ninpnin commented 12 months ago

we probably won't reuse that code again

That sounds like famous last words..

But if Måns agrees, let's just have the PR without the code.

BobBorges commented 11 months ago

famous last words

Code is not deleted, just not committed to the repo :D

MansMeg commented 11 months ago

Long term we dont want the code in this repo anyway (data is the state, not the code).

BobBorges commented 11 months ago

Today I sent the chairs project

a list of empty rows (chair nr/chamber/year, but no name) to fill in manually
a file showing chairs out of sequence or out of range to add/remove from data
and a list of junk values in their party column to indicate what should be done with these values

I can add/edit data that will be returned to me and try again to match unmatched names to wiki-ids.

With matched name-wiki_ids, we can triangulate wikidata MP metadata with their project data to both verify data and potentially fill in gaps in either project's data set.

BobBorges commented 11 months ago

Part of the discussion in old duplicate issues was about helping the chairs project with extracting data. It fell through the cracks a little, but I've been looking at it today. OCR made a mess of the tables in some of the documents, so there's no easy way to do it - I don't know what @JoeNoonan already tried, but there does seem to be some pattern to that mess. So depending on how long Lirre (someone please tag him) thinks manually generating 1800+ rows, maybe we could try mixing a bit of code and manual work to wrangle up the missing data faster.

BobBorges commented 11 months ago

I've now formatted the chairs.csv and chair_mp.csv files according to our ideas about metadata and added 6 unit tests about the metadata integrity -- one will fail still because there are 97 wiki ID in two chairs at once. These obv need to get sorted out somehow -- I think I will just add it to my list of things to do manually. I looked at a handful of them and the reasons may be:

wiki_id in ak and fk in the same year (The one I looked up in bio book was on in one chamber -- is it possible to be in both at the same time?)
name -- wiki_id mismatch: one case it was same surname and first initial, matched to one ID, but actually different people

There's no unit test for coverage, because we aren't up to speed with that yet. And I could add some other triangulation tests, like we discussed elsewhere -- the person is associated with a chair in the same general time frame (year) as their mandate.

@Lirre @JoeNoonan I sill have the aggregated chair/chair_mp file if you need that.

MansMeg commented 11 months ago

If you have a lot to do, you can also just create a google sheet that @fredrik1984 or @salgo60 could work with re the 97. I have the impression that they both like this type of bug hunting. =)

BobBorges commented 10 months ago

I checked the names in the original chairs data against the wiki_id's Lirre filled in, and there are a dozen that I think need double checking.

As an example:

The top row is the wiki_id + name from Lirre's work and the table immediately below are all the name variants for that wiki_id in our metadata.

@SimonHallen

problems_LKv3.txt

salgo60 commented 10 months ago

I guess it would add value if all those spellings of a persons name could be added as alias in WD

in WD we have chairs see below I am not sure a chair in "Riksdagen" is worth tracking for WD....

seat 2 of the Swedish Academy Q96600293
- we then use position held P39 - SPARQL
seat 2 of the Académie française Q70495940 - SPARQL

These chairs probably have a different meaning than a seat in the Riksdagen (The Swedish Parliament) but its still a chair 😄

MansMeg commented 10 months ago

Well spotted! @SimonHallen , I think you need to double check the wikiids in @BobBorges file.

SimonHallen commented 10 months ago

I will take a look on it!

BobBorges commented 8 months ago

@MansMeg I can't see where to mark that your requested changes have been implemented, but when tests pass here, this will be ready.

MansMeg commented 8 months ago

Great! As long as the tests pass Im happy.

MansMeg commented 8 months ago

Just a question, I cant see exactly which files has been changed due to there being so many files? I would not expect it to be changes in this many files. Why is this the case?

BobBorges commented 8 months ago

I needed the stuff I did in the SWERIK-ID branch, so I pulled from there.

MansMeg commented 8 months ago

Ah. Of course. Also this still fails?

BobBorges commented 8 months ago

sorry :|

welfare-state-analytics / riksdagen-corpus