welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Remove strange wikidata punctuation on location specifiers #220

Closed MansMeg closed 10 months ago

MansMeg commented 1 year ago

See for example: Q5885293,"Kråkered," (now fixed)

We should not include location specifiers with punctuation if the ordinary name exist (like Kråkered in this case).

see: https://github.com/welfare-state-analytics/riksdagen-corpus/blob/main/corpus/metadata/location_specifier.csv

ninpnin commented 1 year ago

This is a problem upstream https://www.wikidata.org/wiki/Q5885293

MansMeg commented 1 year ago

Yes. The question is how to solve this. I guess we would like to remove stuff in our corpus but that people might want to keep in wikidata, so that there will not be a perfect alignment with wikidata. Maybe add a csv with stuff we exclude from wikidata we add to the updating script from wikidata? Or do you have another solution?

ninpnin commented 1 year ago

I mean those misspellings could be just fixed on wikidata?

EDIT: AFAIK those additional commas don't introduce any errors to our corpus

MansMeg commented 1 year ago

No. I know. My point is that sooner or later we might end up with differences. But maybe not in the next couple of moths. Then fixing this in wikidata is probably easiest.

BobBorges commented 1 year ago

They need to be edited on wikidata:

MansMeg commented 1 year ago

Ping @salgo60 . Is this something you could take a pass on?

salgo60 commented 1 year ago
salgo60 commented 1 year ago

@MansMeg what problem did you find with Q117288109


MansMeg commented 1 year ago

I think that one is actually a problem with us grabbing the data. Here we use the alias that is incorrect. @BobBorges , right?

salgo60 commented 1 year ago

All checked not all changed as I didnt see a problem...


Off topic I mentioned your project today as a pattern how other organizations should work with its metadata

image
BobBorges commented 10 months ago

Should be fixed now. If we find this as an issue again, we could write a unit test. Caused by trailing commas (removed on wikidata) and alias/i-ort in the format surname-iort, firstname. Fixed on wikidata.