welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
26 stars 5 forks source link

MP chairs metadata mismatch errors #450

Open BobBorges opened 5 months ago

BobBorges commented 5 months ago


We have been collaborating with an external research project that matches MPs to the chairs they were sitting in during Riksdag sessions -- very generally we want to merge their data set with the SWERIK data. Doing so gives us another way to quality control SWERIK data and gives them the possibility to access a more robust metadata set for each person.

The last round of attempted merging resulted in some 1200 rows that didn't cooperate -- after looking at some random rows, it usually means an error in the start date, end date, or chamber assigned to the particular person's mandate in either the chairs data set or the SWERIK data set.

the task

We need to check these cases to make sure that the matching errors do not originate in the SWERIK / wikidata side.

For each row in the attached file of chairs data, find the person (by swerik_id) in the SWERIK data / wikidata, observe the difference (why the row didn't match), and if the error is on the SWERIK/wikidata side, correct the error on wikidata.

Keep a note of changes you make on wikidata (we want to make sure these changes don't get reverted) and indicate rows where the error is in the chairs data (so we can tell the people in the other project).


LaurineMir commented 5 months ago

I can do it !

salgo60 commented 5 months ago

I used OpenRefine and matched the list with wikidata and fetched date born/dead that can indicate problems.

image image image

issues seen with Riksdagens data

maybe she had a chair as she was part of Palme I Cabinet (Q10650456) which is not mentioned in Riksdagens data

image image

Lesson learned: the importance that all projects use persistent identifiers and same as feels the same work ia now done again

BobBorges commented 5 months ago

Thanks @salgo60. In this case the data we were trying to integrate came from a project that wasn't using wiki IDs or other identifiers at all, so we aren't doing anything twice. The csv file in the issue was my attempt to match the names in their data to ours ... we matched around 80% of the names in an automated way, and this file are those people who we now think aren't matched correctly due to precisely the reason you point out. We expect that either (a) the attached list is due to errors in matching the names , (b) that the data set we're trying to incorporate has ocr errors in the dates and 1963 is actually 1863, or (c) that there are some other unknown issues going on. This is why @LaurineMir is looking at these cases carefully.

salgo60 commented 5 months ago

In this case the data we were trying to integrate came from a project that wasn't using wiki IDs or other identifiers at all, so we aren't doing anything twice.

I understand and the problem I try to point out is the cost of not using persistent identifiers from day 1 i.e. you should have started with persistent identifiers from day 1 and the other project also delivering 5 star data using your identifiers

I will be tomorrow at Sörmlands museum and a session called ”Kulturarvsforum 2024 - Vad får jag berätta om dig? Om GDPR och immateriella rättigheter i praktiken

and my feeling if people are not mentioned by a persistent identifier how should we know what person we speak about ;-)

my point is that those museums trying to do a correct GDPR work also needs to deliver an echo system and having tools like error reporting in GITHUB....

BobBorges commented 5 months ago

@salgo60 -- I agree with you 100%. Unfortunately the reality in humanities and social science is that many people simply are not aware of these issues -- it's not usually part of education and not commonly implemented in practice. Data management is getting better in general, but there's still a long way to go.

salgo60 commented 5 months ago

OT: @BobBorges have you seen this https://partyfacts.herokuapp.com

it was mentioned on CLARIN Café - ParlaMint 30 January 2024 , 14:00 - 16:00 - youtube

14:00 - 14:05 Opening and CLARIN 1-0-1 (Francesca Frontini, Member of the CLARIN Board of Directors) 14.05 - 14:15 Introduction to ParlaMint (Maciej Ogrodniczuk and Petya Osenova) 14.15 - 14.25 ParlaMint 4.0 corpora (Tomaž Erjavec) 14.25 - 14.30 Adding metadata (Katja Meden and Jure Skubic) 14.30 - 14.35 -ed version (Nikola Ljubešić and Taja Kuzman) 14.35 - 14.40 Semantic tagging (Paul Rayson) 14.40 - 14.45 Impact story from a Computational Linguistics point of view (Bojan Evkoski) 14.45 - 14.50 Talking War: Keeping the Past Alive in the Parliaments of former Yugoslavia (Michal Mochtak) 14.50 - 15.00 The Catalan ParlaMint corpus (Nuria Bel) 15.00 - 15.10 The Hungarian ParlaMint corpus (Noémi Ligeti-Nagy) 15.10 - 15.20 The Austrian ParlaMint corpus (Tanja Wissik and Hannes Pirker) 15.20 - 16.00 Q&A

another comment

Colleagues from Finland have been working on a knowledge graph of their politicians:
image image image image image image image image image image image image image

Another comment

Parlamint: https://www.clarin.eu/parlamint

LREC-COLING 2024 Workshop: https://www.clarin.eu/ParlaCLARIN-IV

Stay up to date with the cafés: https://www.clarin.eu/content/clarin-cafe

salgo60 commented 5 months ago

Interesting is that finnish Yle use Wikidata to organize its content --> that data can easier be used by researchers i.e. in Sweden we should try to convince other organisations to step up and add structure see skuggbacklog I have tried speaking with Swedish SVT 2020 as they closed done svt öppet arkiv (property proposal 2018-dec) that had a property in wikidata P6817 that now is maybe useless - looks like they have changed it lately example Olof Palme - list of items with sweriks and svtplay most non working....

More info Yle Wikidata **Click** image image image image image
salgo60 commented 4 months ago

@MansMeg Clarin Café och https://partyfacts.herokuapp.com/