Problems with metadata updates from wikidata

MansMeg commented 1 year ago

In the discussion of Pull Request #344 we identified four different problems

Incorrect updates used in the mapping algorithm. We captured incorrect revisions made by wikidata users in iort (changed correct iorts) using the unit test. However, the mapping algorithms had already been run using the incorrect data. Running some of the tests before the mapping algorithm might be a solution.
New "duplicate" people are being added to Wikidata but are not merged - causing errors in the mapping algorithm Continuously new people can be added by Wikidata users. This means that when we do the metadata update there can be new duplicates (the same person as multiple wikidata entries). A solution is to list these potential duplicates (new names the mapping algorithm will confuse) with new names/persons the algorithm finds difficult and check them quickly before we run the algorithm.

A potential solution is to structure the metadata updates more than we currently do to capture potential problems more efficiently.

BobBorges commented 1 year ago

Re 2, they're not new people, but new attributes or whatever that cause more rows to be created in the input/matching/*.csv files.

BobBorges commented 1 year ago

I will try to outline a procedure for updates from wikidata before the next time we do it, hopefully to avoid some of the trouble we ran into this time.

salgo60 commented 1 year ago

Let me know if WD has errors

Add sources when changing values

I saw some earlier edits done by the project in Wikidata without sources....

[ ] Let me know if we should have a short session where I show you how to add/copy the source to a statement
- statements without a source in Wikidata are not preferred as the source should confirm what you find in WD and make Wikidata a little bit more trustworthy....

alias vs. Name

In WD we can just have sources on properties e.g. change Q5792849 1940037453 on the Name property

—— My personal opinion is that the alias field should be used when doing Named Entity Recognition and can contain “all kind” of information compared to the Name Property P2561 were we should have sources confirming the values

BobBorges commented 1 year ago

In this particular case, it was two individuals in question. My previous edits (with source) were further edited. I put the changes back yesterday.

The edits in question have to do with apparent spelling variants of iort. I don't know myself what the correct variant is, my edits are in line with the spelling in the bio books. If there are sources for the other spelling, then I guess both variants should be on wikidata.

MansMeg commented 1 year ago

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

BobBorges commented 1 year ago

That's what i was trying to say - my edits have bio book sources and spelling. If alt spellings will also be entered, they should get their own source.

salgo60 commented 1 year ago

Yes. But the spelling in the biobooks should be the one that is used when the reference is the biobooks. Right, @salgo60 ?

Yes as mentioned before

the book "Tvåkammar-riksdagen 1867–1970" has rather often more articles about the same see > 150 persons, see a small check
- person that I guess are published at different times --> the book themself will maybe have different "i riksdagen kallad" for the same person.... today I only reference a person once even if they are in more articles which I feel is wrong see example below.
  - we also have as a source the book "Enkammarriksdagen 1971-1993/94" I dont know if someone has checked if the two books has the same facts.... I guess not ...
  - Riksdagen has a field "iort" looks they only can store one value in it.... and I feel normally bad quality data se #141
- example person with more articles Q5795740 "Hederstierna i Stockholm senare Västerås" in books 1:436 | 2:158 | 4:92 - are they identical? in the below example we can see they dont have the same parties --> we should add all books and see what facts they confirm - here is a list of > 150 persons with more articles
  - From the example below we can see that the naming of parties are not consistent and its different
    - 1:436 use "skånska p" / "centern"
      - my try is
        
        skånska p = Q10671173
        
        centern = Q10411412 not same as Q10444846
    - 2:158 use just centern not "skånska p"
    - 4:92 use "skånska p" and "AK:s center" t
    - Its very important that we get ONE list of parties with unique persistent identifiers that we think has existed and what different name strings are the same parties...

Volume 1 page 436 - skånska p / centern

Volume 2 page 158 - centern

Volume 4 page 92 - skånska p / AK:s center

there are also other sources for "i riksdagen kallad" as the book "Enkammarriksdagen 1971-1993/94"
- Riksdagen has in its open data a field that only can handle one value see #141 - I guess bad quality....

What would be interesting is if we could confirm what is stated in the books with where its mentioned in your corpus and get a better understanding/quality by adding a Property:P4584 "first appearance" based on your corpus

[ ] that every unique combination person Ior get an unique persitent identifier in your corpus
[ ] that in the Swedish Corpus you also in the TEI code track when its used and have the persistent idemtifier in the TEI
[ ] "we" in wikidata can say that the IoT name is same as Riksdagen-corpus xxx
[ ] we could start tracking when all unique IoT are first and lasted used based on your corpus

Sources

I would also like to see in your data

[ ] persistent unique identifiers for every source used ex. books "Tvåkammar-riksdagen 1867–1970" should have an unique persistent identifier for every book

Examples when "Tvåkammar-riksdagen 1867–1970" is wrong

we had today an discussion about Bertha Wellin Q4895524 and then it looks like both "Tvåkammar-riksdagen 1867–1970" and SKBL is wrong see Diskussion:Bertha_Wellin#WD-mallen
SPARQL depreciated statements - swedish - english
- see same problem with Riksarkivet SBL #35
  - SPARQL WIkidata

My suggestion step up and use sources and persistent identifiers

see DIGG project diggsweden/persistent-identifiers-investigation
[ ] create persistent identifiers for all source articles you will reference i.e. if a person has 2 articles that should have 2 persistent unique identfiers
- SPA has scanned most of the articles but when the person had no picture in the books I feel they didnt scan it....
[ ] every unique combination "i riksdagen kallad" and person should have a persistent identifier
[ ] all "parties" (and also wrong parties mentioned in the book also that you dont agree with) should have unique persistent identifiers
- use semantics to say same as , followed by P156, said to be the same as P460, merged into P7888
[ ] in the corpus the TEI file should for every "i riksdagen kallad" have a same as the persistent identifier you have for the "i riksdagen kallad"
[ ] in Wikidata we should say "same as" your persistent identifiers for every "i riksdagen kallad"
[ ] preferred would be if you had landing pages that we could link
- [ ] every person
- [ ] every "i riksdagen kallad"
  - [ ] easy see when used for the first time in the corpus and the last
  - [ ] also easy access the corpus --> we could link it from sv:Wikipedia
[ ] that we have a 30 minutes walk through how Wikidata and how gadgets works and how we use sources to say this source confirm or in the source we read xxx and translate that to e.g. death reason yyy see death reasons
[ ] Wkidata related activities
- [ ] If you start using PIDs we create a request for a Wikidata property for the welfare-state-analytics project
  - [ ] if you have a PIDs created we connect them to WIkidata
    - [ ] Wikidata Swedish PM <-> welfare-state-analytics
    - [ ] Wikidata "I Riksdagen kallad" <-> welfare-state-analytics
    - [ ] Wikidata "describer by source" <-> welfare-state-analytics trusted sources you use see #324
    - [ ] Wikidata "party" <-> welfare-state-analytics
    - [ ] Wikidata "position" e.g. Member of the First Chamber Q33071890 / Member of the Second Chamber Q81531912 / Member of the Riksdag of the Estates Q82697153 / member of the Swedish Riksdag Q10655178 <-> welfare-state-analytics
    - [ ] Wikidata "position minister" eg. Minister of Trade Q10686108 list<-> welfare-state-analytics
      - wd:Q687075 Prime Minister of Sweden
      - wd:Q920108 Minister for Defence
      - wd:Q1749063 Minister for Foreign Affairs of Sweden
      - wd:Q3315958 Minister for Justice
      - wd:Q3612254 Minister for Energy
      - wd:Q4189744 Deputy Prime Minister of Sweden
      - wd:Q4806239 Minister for Nordic Cooperation
      - wd:Q6865819 Minister for EU Affairs
      - wd:Q6865835 Sweden's Minister for Finance
      - wd:Q6865858 Minister for Gender Equality Affairs
      - wd:Q6865890 Minister for Integration
      - wd:Q10430169 Deputy Minister of Employment
      - wd:Q10430171 Deputy Minister of Finance
      - wd:Q10430174 Deputy Minister of Justice
      - wd:Q10497411 Minister for Resource Management
      - wd:Q10541572 Prime Minister of Justice
      - wd:Q10547443 Minister of Municipalities
      - wd:Q10650489 Swedish Government Offices
      - wd:Q10670469 Minister of Primary Education
      - wd:Q10686032 Minister of Employment
      - wd:Q10686038 Minister for International Development Cooperation
      - wd:Q10686041 Minister for Housing
      - wd:Q10686046 Minister of Civil Affairs
      - wd:Q10686108 Minister of Trade
      - wd:Q10686169 Minister for Infrastructure
      - wd:Q10686171 Minister for Domestic Affairs
      - wd:Q10686194 Minister for Rural Affairs
      - wd:Q10686216 Minister for the Environment
      - wd:Q10686220 Minister for Enterprise
      - wd:Q10686247 Minister for Social Affairs
      - wd:Q10686263 Minister for Education and Science
      - wd:Q17103474 Minister for Financial Markets
      - wd:Q18176984 Minister for Culture of Sweden
      - wd:Q18183298 Minister for Migration
      - wd:Q18242897 Minister of Immigration
      - wd:Q18246311 Minister of Democracy
      - wd:Q18246318 Minister for Social Security
      - wd:Q18589764 Minister of Youth Affairs
      - wd:Q18589796 Minister for Consumer Affairs
      - wd:Q19975875 Minister for Information Technology
      - wd:Q19977291 Minister of Taxation
      - wd:Q26659741 Prime Minister for Foreign Affairs
      - wd:Q39074196 Minister for the Climate (Sweden)
      - wd:Q39074295 Minister for Government Coordination
      - wd:Q59392942 Minister of Communications
      - wd:Q83933521 Minister for Terrestrial Defence
      - wd:Q83933561 Minister for Naval Defence
      - wd:Q84566470 Minister of Ecclesiastics
      - wd:Q87748605 Minister of Sports
      - wd:Q93262421 Minister of Economic Affairs
      - wd:Q93270583 Minister of Budget Affairs
      - wd:Q95187493 Minister of Foreign Trade
      - wd:Q95187834 Minister of Health
      - wd:Q95983946 Minister for Higher Education and Science
      - wd:Q105359146 Vice Prime Minister of Sweden
      - wd:Q110820075 Minister of Agriculture
      - wd:Q114734692 Minister of Civil Defence
      - wd:Q114736354 Minister for the Elderly
      - wd:Q114736377 Q114736377
      - wd:Q118352089 Statsministerns statssekreterare
      - wd:Q120737220 Ministry for Rural Affairs
      - list all positions people in WD in First and sec chamber has had SPARQL - swedish

BobBorges commented 11 months ago

I like the idea of persistent identifiers. Until then, I think we can solve (close) this issue with a metadata update procedure.

start a fresh branch off dev
requery metadata scripts/wikidata_query.py and 'scripts/wikidata_process.py`
run test.db.py locally
- will find changed wiki_id
- will find edits that conflict with our unit test files (someone edits / deletes iort from wikidata)
---> update wiki_ids in unit test files (I will write a script to do this efficiently) ---> address edits on wikidata
repeat 1 and 2 until test.db.py passes
redetect.py to remap speakers to intros in protocols
run test.mp.py locally (other tests?)
- ensure that everything looks like it works (no stray wiki IDs that aren't in metadata or whatevr)
save diff to (an untracked) file
- this helped me, when I could search the whole diff to give good answers to those looking at the PR
sample-git-dif on protocols
- mk markdown
git add ONLY sampled protocols -commit / push -open pr -- post markdown

---> unit tests will still fail on remote : is ok
when sampled diffs look ok
- add /commit /push rest of protocols
- unit tests should pass on remote --> merge