welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
26 stars 5 forks source link

Problems with metadata updates from wikidata #345

Closed MansMeg closed 11 months ago

MansMeg commented 1 year ago

In the discussion of Pull Request #344 we identified four different problems

  1. Incorrect updates used in the mapping algorithm. We captured incorrect revisions made by wikidata users in iort (changed correct iorts) using the unit test. However, the mapping algorithms had already been run using the incorrect data. Running some of the tests before the mapping algorithm might be a solution.
  2. New "duplicate" people are being added to Wikidata but are not merged - causing errors in the mapping algorithm Continuously new people can be added by Wikidata users. This means that when we do the metadata update there can be new duplicates (the same person as multiple wikidata entries). A solution is to list these potential duplicates (new names the mapping algorithm will confuse) with new names/persons the algorithm finds difficult and check them quickly before we run the algorithm.

A potential solution is to structure the metadata updates more than we currently do to capture potential problems more efficiently.

BobBorges commented 1 year ago

Re 2, they're not new people, but new attributes or whatever that cause more rows to be created in the input/matching/*.csv files.

BobBorges commented 1 year ago

I will try to outline a procedure for updates from wikidata before the next time we do it, hopefully to avoid some of the trouble we ran into this time.

salgo60 commented 1 year ago

Let me know if WD has errors

Add sources when changing values

I saw some earlier edits done by the project in Wikidata without sources....

alias vs. Name

In WD we can just have sources on properties e.g. change Q5792849 1940037453 on the Name property


—— My personal opinion is that the alias field should be used when doing Named Entity Recognition and can contain “all kind” of information compared to the Name Property P2561 were we should have sources confirming the values

BobBorges commented 1 year ago

In this particular case, it was two individuals in question. My previous edits (with source) were further edited. I put the changes back yesterday.

The edits in question have to do with apparent spelling variants of iort. I don't know myself what the correct variant is, my edits are in line with the spelling in the bio books. If there are sources for the other spelling, then I guess both variants should be on wikidata.

MansMeg commented 1 year ago

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

BobBorges commented 1 year ago

That's what i was trying to say - my edits have bio book sources and spelling. If alt spellings will also be entered, they should get their own source.

salgo60 commented 1 year ago

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

Yes as mentioned before

Volume 1 page 436 - skånska p / centern

image image

Volume 2 page 158 - centern

image image

Volume 4 page 92 - skånska p / AK:s center


What would be interesting is if we could confirm what is stated in the books with where its mentioned in your corpus and get a better understanding/quality by adding a Property:P4584 "first appearance" based on your corpus

  1. [ ] that every unique combination person Ior get an unique persitent identifier in your corpus
  2. [ ] that in the Swedish Corpus you also in the TEI code track when its used and have the persistent idemtifier in the TEI
  3. [ ] "we" in wikidata can say that the IoT name is same as Riksdagen-corpus xxx
  4. [ ] we could start tracking when all unique IoT are first and lasted used based on your corpus


I would also like to see in your data

  1. [ ] persistent unique identifiers for every source used ex. books "Tvåkammar-riksdagen 1867–1970" should have an unique persistent identifier for every book

Examples when "Tvåkammar-riksdagen 1867–1970" is wrong

My suggestion step up and use sources and persistent identifiers

BobBorges commented 11 months ago

I like the idea of persistent identifiers. Until then, I think we can solve (close) this issue with a metadata update procedure.

  1. start a fresh branch off dev

  2. requery metadata scripts/wikidata_query.py and 'scripts/wikidata_process.py`

  3. run test.db.py locally

    • will find changed wiki_id
    • will find edits that conflict with our unit test files (someone edits / deletes iort from wikidata)

    ---> update wiki_ids in unit test files (I will write a script to do this efficiently) ---> address edits on wikidata

  4. repeat 1 and 2 until test.db.py passes

  5. redetect.py to remap speakers to intros in protocols

  6. run test.mp.py locally (other tests?)

    • ensure that everything looks like it works (no stray wiki IDs that aren't in metadata or whatevr)
  7. save diff to (an untracked) file

    • this helped me, when I could search the whole diff to give good answers to those looking at the PR
  8. sample-git-dif on protocols

    • mk markdown
  9. git add ONLY sampled protocols -commit / push -open pr -- post markdown

    ---> unit tests will still fail on remote : is ok

  10. when sampled diffs look ok

    • add /commit /push rest of protocols

    • unit tests should pass on remote --> merge

The issues last time around would have been spotted and fixed very quickly if I were following this as a guide.

MansMeg commented 11 months ago

That sounds like a good solution. Maybe put this in the repo wiki for now?

salgo60 commented 11 months ago

FYI: We have a suspected duplicate in WIkidata that I have asked other people for a second opinion but no feedback yet


I used Property:P460 "said to be the same as"


The sv:Wikipedia article is marked

BobBorges commented 11 months ago

Maybe put this in the repo wiki for now?
