welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

adding metadata related to chairs #347

Closed BobBorges closed 8 months ago

BobBorges commented 1 year ago

Here comes some metadata related to chairs that I've been working on. What's here:

files

summary stats

party coverage

Edit: updated numbers after bug fix.

BobBorges commented 1 year ago

In order to best decide where to start improving the chairs/party coverage, I'm attaching a summary stats json file where you can look at coverage of each year/chamber (delete .txt from the extension -- github doesn't allow json :| ).

metafile_eval.json.txt

image

Edit: new eval file -- metafile_eval.json.txt

BobBorges commented 1 year ago

dates

Start and end dates are still an issue here. The chairs data came with "parliament_year" as a column, and like our other metadata, the actual start & end dates often need to be inferred. It would be good, I think, if we use this as an impetus to create a metadata file with sth like parliament_year, start, end, which will be useful for the chairs and other parts of our work.

For those names that we now have a corresponding wiki_id, I can try to pull start/end dates from the metadata if it is deemed necessary.

BobBorges commented 1 year ago

party coverage

There's some junk in the chairs data party columns. I could fairly easily remove/change it, which may also improve the matching, but I don't feel completely competent to decide what's complete junk and which junk might be recognizable as not-junk (e.g. in the case of an ocr error).

party_counts.json.txt

MansMeg commented 1 year ago

This looks great Bob!

Some quick comments. We should add a test that we have all chars and not incorrect chairs in the chairs.csv. Now I saw a value of 1300ish as a chair that is obvious incorrect. So I suggest we remove such entries.

I think your idea of creating a parliament_year.csv metadatafile make a lot of sense. Should we do it in this PR or seperately? Im not sure if this data exist on wikidata?

Also, this is alot if info? Maybe you could do a presentation about this friday?

MansMeg commented 1 year ago

My general thought is that we can try to see how far we can come with computational means. Then the vote protocol project can fill in the gaps when the general structure is set.

fredrik1984 commented 12 months ago

I think your idea of creating a parliament_year.csv metadatafile make a lot of sense. Should we do it in this PR or seperately? Im not sure if this data exist on wikidata?

@MansMeg @BobBorges of course there is such a list!: https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_svenska_riksdagar

I also think it would be good with a short chair status presentation on Friday when Jan Teorell is there as well. The Friday meeting will most likely circle around chair and party

MansMeg commented 12 months ago

Of course there is such a list. :)

BobBorges commented 12 months ago

I've got the list and making a metadata file out of it. @MansMeg I didn't want to delete data, but indeed out of range chair nrs should not be allowed in the file.

ninpnin commented 12 months ago

Should this PR include the code used for data processing?

BobBorges commented 12 months ago

Should this PR include the code used for data processing?

I'm happy to add whatever code if the rest of you want to have it. I didn't already put it in the PR b/c I guess once the metadata files are in the corpus, we probably won't reuse that code again.

ninpnin commented 12 months ago

we probably won't reuse that code again

That sounds like famous last words..

But if Måns agrees, let's just have the PR without the code.

BobBorges commented 11 months ago

famous last words

Code is not deleted, just not committed to the repo :D

MansMeg commented 11 months ago

Long term we dont want the code in this repo anyway (data is the state, not the code).

BobBorges commented 11 months ago

Today I sent the chairs project

I can add/edit data that will be returned to me and try again to match unmatched names to wiki-ids.

With matched name-wiki_ids, we can triangulate wikidata MP metadata with their project data to both verify data and potentially fill in gaps in either project's data set.

BobBorges commented 11 months ago

Part of the discussion in old duplicate issues was about helping the chairs project with extracting data. It fell through the cracks a little, but I've been looking at it today. OCR made a mess of the tables in some of the documents, so there's no easy way to do it - I don't know what @JoeNoonan already tried, but there does seem to be some pattern to that mess. So depending on how long Lirre (someone please tag him) thinks manually generating 1800+ rows, maybe we could try mixing a bit of code and manual work to wrangle up the missing data faster.

BobBorges commented 11 months ago

I've now formatted the chairs.csv and chair_mp.csv files according to our ideas about metadata and added 6 unit tests about the metadata integrity -- one will fail still because there are 97 wiki ID in two chairs at once. These obv need to get sorted out somehow -- I think I will just add it to my list of things to do manually. I looked at a handful of them and the reasons may be:

There's no unit test for coverage, because we aren't up to speed with that yet. And I could add some other triangulation tests, like we discussed elsewhere -- the person is associated with a chair in the same general time frame (year) as their mandate.

@Lirre @JoeNoonan I sill have the aggregated chair/chair_mp file if you need that.

MansMeg commented 11 months ago

If you have a lot to do, you can also just create a google sheet that @fredrik1984 or @salgo60 could work with re the 97. I have the impression that they both like this type of bug hunting. =)

BobBorges commented 10 months ago

I checked the names in the original chairs data against the wiki_id's Lirre filled in, and there are a dozen that I think need double checking.

As an example: image

The top row is the wiki_id + name from Lirre's work and the table immediately below are all the name variants for that wiki_id in our metadata.

@SimonHallen

problems_LKv3.txt

salgo60 commented 10 months ago

I guess it would add value if all those spellings of a persons name could be added as alias in WD

in WD we have chairs see below I am not sure a chair in "Riksdagen" is worth tracking for WD....

image

These chairs probably have a different meaning than a seat in the Riksdagen (The Swedish Parliament) but its still a chair 😄

MansMeg commented 10 months ago

Well spotted! @SimonHallen , I think you need to double check the wikiids in @BobBorges file.

SimonHallen commented 10 months ago

I will take a look on it!

BobBorges commented 8 months ago

@MansMeg I can't see where to mark that your requested changes have been implemented, but when tests pass here, this will be ready.

MansMeg commented 8 months ago

Great! As long as the tests pass Im happy.

MansMeg commented 8 months ago

Just a question, I cant see exactly which files has been changed due to there being so many files? I would not expect it to be changes in this many files. Why is this the case?

BobBorges commented 8 months ago

I needed the stuff I did in the SWERIK-ID branch, so I pulled from there.

MansMeg commented 8 months ago

Ah. Of course. Also this still fails?

BobBorges commented 8 months ago

sorry :| image