MansMeg commented 1 year ago

There is a need from the wikidata people to refer to our corpus (from version 1.0) as a reference on the data. Hence we should make our ids persistent.

This would include creating uuids for all csv-files
Create wikidataid to person id csv mapping file.

ninpnin commented 1 year ago

I suggest we use firstname_lastname_yyyymmdd (birthdate). It is static given that the primary name of the person and the birthdate don't change, and for the most part they shouldn't. I have also checked that there are no conflicts. On the other hand, only using birthyear leads to a handful of conflicting IDs.

If the birthday isn't available, we would use firstname_lastname_yyyymmXX or firstname_lastname_yyyyXXXX.

MansMeg commented 1 year ago

People change names so this might be confusing long term. Maybe just use a uuid? That we know will persistent.

salgo60 commented 1 year ago

I would say you should have id:s for everything parties/PM members/departments/electoral districts/subjects/.... and do like Wikidata just an id with no meaning (Q is from the name of Dennys wife Qamarniso Q61768970)

Swedish Riksdagen has a solution were just the last part the GUID is a Slug - the rest just makes the URL more "user-friendly" or complex 💣
ulf-kristersson is just to make it human readable
URL Slug Best Practices

redirect

Another lesson learned is support redirects ---> When e.g. #88 Riksdagens does mistakes and adds 2 id:s for the same person (and never fix it 😢 ) its easy you also get "2 people" --> they should be merged on your side and IF the end user still have the "old id" they should find the merged target..,.. --> owl:sameAs

MansMeg commented 1 year ago

That sounds like a good idea. Best of both worlds. =)

BobBorges commented 1 year ago

Why are the wiki_ids not persistent? It seems like the least expensive solution (for us, since we used the QIDs in protocol documents) would be to convince wikidata to make the QIDs persistent.

MansMeg commented 1 year ago

@salgo60 know this better than me. But I think the core problem is that anyone can create a new person (hence a new id). This can then be merged. So it is a ”flaw” of the wikidata structure.

In addition, wikidata would like us to have persistant id that they could reference to. Ie our corpus will (after 1.0) be a reference for the quality control of wikidata.

I hope this explains why.

salgo60 commented 1 year ago

@MansMeg @ninpnin maybe its time for starting the process of getting persistant unique Welfare state analytics ids #269

See how Nobelprize.org redesigned its data with an API and then @miroli proposed a Wikidata id P8024 --> we can now access the WD object using the Nobelprize unique id...

Design task T248939
Example 2022 Nobelprize winners
- Svante Pääbo API 1011 --> Wikidata = https://hub.toolforge.org/P8024:1011?site=wd
- 1012 --> Wikipedia en P8024:1012 - wikidata https://hub.toolforge.org/P8024:1012?site=wd
- 1013 --> Wikipedia en P8024:1013 - wikidata https://hub.toolforge.org/P8024:1013?site=wd
- 1014 --> Wikipedia en P8024:1014 - wikidata https://hub.toolforge.org/P8024:1014?site=wd
- ....

salgo60 commented 1 year ago

@salgo60 know this better than me. But I think the core problem is that anyone can create a new person (hence a new id). This can then be merged. So it is a ”flaw” of the wikidata structure.

In addition, wikidata would like us to have persistant id that they could reference to. Ie our corpus will (after 1.0) be a reference for the quality control of wikidata.

I hope this explains why.

I would say that Wikidata is not designed to be the source and its better as I describe above that you have an unique persistent id as the update frequency in WD is crazy and its an open system with its strengths and weakness... also supporting > 200 languages make this equation nearly impossible and we merge a lot - see real time stream

The design as I understand it is not about the truth more what other sources claim --> Wikidata can also store contradicting facts...

a good article about how Wikidata was born was written when Wikidata had 10 years celebration 2022also Wikidata will be presented on The webconf see tweet/video trailer for the talk

possibility to have more facts with contradicting values
rank the preferred one
1. see how we can track facts from Riksarkivet SBL #33 and how we also track the reason why we dont trust what Riksarkivet SBL presents like "contemporary constraint issue Q74557669" / "not confirmed by birth records Q111149276"

BobBorges commented 1 year ago

@MansMeg @ninpnin @fredrik1984 @liamtabib

We discussed persistent IDs this morning. There's already an open issue, so I didn't want to start a new one. Regardless of the format we use for the IDs, it seems like we need to obtain/create a property item on wikidata, something like SWERIK_MP_ID. According the this such an needs to be proposed and discussed "for some time" before it can be approved --- do we know @salgo60 if it's already been proposed and/or how long is "some time"? Maybe we should decide on the property name and propose it ASAP if it hasn't been done already.

There has been discussion about whether to use name/birth date or a uuid. I see the sense in using a UUID, but also sense in having a deterministic ID -- I suggest that we create a UUID deterministically using the primary name/surname and birth date as a seed (we can use pyriksdagen.utils.get_formatted_uuid as a starting point) -- best of both worlds?.

What do you all say?

liamtabib commented 1 year ago

Good idea!

MansMeg commented 1 year ago

That works for me. The only important thing is that the IDs are persistent. I.e. we need to commit to the IDs, and they will never change after they are assigned to an individual. How we create them is less important, as long as it is uuids.

I think the discussions on Wikidata will be less of a problem if we set up a persistant id, since these IDs will probably be the only persistent ids for MPs going far back in time.

salgo60 commented 1 year ago

WD need a formatter string and some examples

See how a proposal looks like that I created 11:39, 21 September 2016

https://www.wikidata.org/wiki/Wikidata:Property_proposal/SBL

Anyone can create a proposal and everyone can comment and vote on it.... my experience is that it takes some weeks to get it approved...

I am out kayaking this week and can help you when I am back but it is no rocket science so give it a try...

One thought I had if we could use Liberis-URI or the one Riksdagens has dependent were you will store your data

Landing pages

Would be nice if you had landing pages --> we could link you from Swedish Wikipedia

objects like

Swedish PM
parties
electoral districts
...

It's easy extracting text and pictures from Swedish Wikipedia see examples I did for people doing an app with Swedish cemeteries

OT there is a WD conference

Would be interesting if you shared you experience as researcher's how you experience working with Wikidata see tweet what is missing and can be better...

UPDATE: Wikidata modelling days 2023 looks like a researcher Daniel Mietchen is part he is also involved in designing Scholia see video

fredrik1984 commented 1 year ago

237

BobBorges commented 1 year ago

I'll draft a text for the Motivation part of the wikidata proposal in the next couple of days and post it here for commentary before submitting it. I think there's one unsettled issue, though. There's some consensus on using a UUID solution, but do we want to add some kind of human readable segment so it's clear that these are our UUIDs? E.g.: "SWERIK-6a28a4b0-8f46-4134-a88e-2645b704c9fc" or similar? @salgo60 @ljo any thoughts or best-practices around this?

salgo60 commented 1 year ago

1) unique is the key and and a having a human readable string maybe Will add value or just complexity 😃

Riksdagen has the last part after _ as the key and the text before is just a slug see #issuecomment-1502953996

Extra bonus can be done when approved a) a regular expression Property:P1793 --> we can easy catch wrong edits

SWERIK-6a28a4b0-8f46-4134-a88e-2645b704c9fc --> if we ask chatGPT

^SWERIK-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$

b) URL match pattern Property:P8966 we have tools using the URL to understand what Wikidata property it relates to eg. ^http?:\/\/(?:www.)?fossilworks.org\/cgi-bin\/bridge.pl\?a=taxonInfo&taxon_no=(([1-9]\d{0,5})) relates to Property:P842 c) stability of property value Property:P2668 d) formatter URI for RDF resource Property:P1921 e) Property constraints wikidata has the possibility to add rules as unique see Help:Property_constraints_portal) f) will this PID also support lexemes? Wikidata has > 41000 swedish lexemes see example riksdagen g) owned by Property:P127 h) issue tracker URL Property:P1401 i) user manual URL Property:P2078 j) always nice to understand how its used see used by Property:P1535 I hope those PIDs will be used by Riksarkivet, Riksarkivet SBL, RAÄ, LIBRIS, Europeana, Riksdagens open data..... h) API endpoint URL Property:P6269 i) SPARQL endpoint Property:P5305 .....

salgo60 commented 1 year ago

Would be cool if we could do linked data of your Push release tests we have Software_quality_assurance property = Property:P2992

that maybe could be used for adding all the tests you do --> ** we then create Q numbers for a test like check in Wikidata that Swedish PM people does not
- have position held "member of the First Chamber" and "member of the Second Chamber" at the same time
OT WIkidata has started to release Wikifunctions video and 2023-10-25 it was released Running on WebAssembly

salgo60 commented 1 year ago

Good document about persistent identifiers and see also my "The Magnus list" created 2021 "One way to design a system to be a good external identifier in Wikidata" this list was mentioned by David Shorthouse at 27:50 in the Stanford video - slides "Keepin 'N Sync... with wikidata ... and ORCID...and GBIF"

A Persistent Identifier (PID) policy for the European Open Science Cloud (EOSC)

Good design pattern use tombstone pages

see DIGG discussion "Inaktiva PID / avpublicerat material"
see above PID document

How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

doi.org/10.1101/117812

salgo60 commented 1 year ago

I have also tried to get Riksarkivet to support archived documents and PIDs --> status work in progress :sad::sob: maybe your project can explain that PIDs support in archives are very important for research people

"#17 Riksarkivet: Hantera Persistenta Identifierare i arkiverat material via API:er vad är status?"

Today, I perceive that there is no one else on the line when it comes to discussing persistent identifiers and how they should be supported in archives. DIGG's project does not seem to firmly decide that the National Archives and the Royal Library (KB) should handle this.

tweet

salgo60 commented 1 year ago

List with link WD proposal discussion for latest created properties

salgo60 commented 1 year ago

but do we want to add some kind of human readable segment so it's clear that these are our UUIDs

@BobBorges doi.org/10.1101/117812 states in Lesson 3. Opt for simple, durable web resolution

Trailing characters after the local ID
are discouraged as they unnecessarily increase the variability with which the identifier is represented
and also complicate straightforward appending of the local ID

MansMeg commented 1 year ago

I think going with a pure uuid is probably the simplest. I dont see the value of adding swerik as a slug. Ideally the pid will live longer (with the vorpus) than with the swerik project name.

salgo60 commented 1 year ago

@MansMeg Isnt SWERIK used for every PID? That I feel is not a problem maybe make it easier to understad the context of the PID ... the problem I see is when doing as Riksdagen then you get problems not knowing if you find the some PID...

I hope we in Sweden will move i direction creating our resolving service something lika a Swedish DOI maybe SWEDOI

Observer pattern

I thinks loosely coupled systems should implement the observer pattern so that you can maybe easier show citation graphs - see my suggestion to DIGG people "Best practice needed for understanding who is referencing my PID" and "#17 Vem använder en identifierare"

MansMeg commented 1 year ago

I see that point. But I doubt the swerik name will live long enough. Whatever slug we use we will have this or similar problems. Just going with a uuid is probably the easiest minimal viable uuid and would have the least long term risks, I think.

BobBorges commented 1 year ago

There's some motivation for a persistent SWERIK person ID here: https://docs.google.com/document/d/10_SEVNI7dF46hhnucTps242ntSr1nm_R3EHC7_9Mkjk/edit?usp=sharing

Modeled on @salgo60's example in scope/length/level of detail. Feel free to add any commentary directly to that google document.

MansMeg commented 1 year ago

This is excellent @BobBorges !

I will read and comment. I think this is an issue that I think we can discuss now, and then have a discussion with the TAB next Friday as a last pair of eyes before we go forward and implement.

salgo60 commented 1 year ago

I think one good motivation is with your own persistent identifier you can VERY easy start use SKOS and explain a difference with Wikidata, Riksdagens Oppna data, Riksarkivet SBL, the book "Tvåkammar Riksdagen".....

the party we call xxx is a broader term than WD yyy - skos:broader

WIkidata merge a lot - maybe too much....

example merges of Swedish PMs the last 1000 days

salgo60 commented 1 year ago

There's some motivation for a persistent SWERIK person ID here:

@BobBorges The best motivation I feel is FAIRDATA F1 as you produce research data ut should be FAIRDATA.

Principle F1 is arguably the most important because it will be hard to achieve other aspects of FAIR without globally unique and persistent identifiers

one Wikidata anti-pattern

One antipattern I see in Wikidata that "every" source should confirm the birth of Selma Lagerlöf Q44519#P569 right now 23 references

The Wikidata model lack a Trust dimension. I asked Denny the WD designer of his point of view and wrote a blogpost about it WikidataCon 2019: We need a better model communicating quality/relevance of sources in Wikidata / Provenance

salgo60 commented 1 year ago

I did a small test using PROV with chatGPT and also show how good change tracking SPA Svensk Porträttarkiv has when you use the API link 139#issuecomment-1806804671

tweet Pelle Snickars to get him into the loop

BobBorges commented 1 year ago

https://www.wikidata.org/wiki/Wikidata:Property_proposal/Person#SWERIK_Person_ID

salgo60 commented 1 year ago

If you have a Wiki account don’t hesitate to support it syntax

{{s}} - ~~~~
@miroli @monirbounadi

https://www.wikidata.org/wiki/Wikidata:Property_proposal/SWERIK_Person_ID

salgo60 commented 1 year ago

@BobBorges I heard comments from your statement

Wikidata IDs, however, are dynamic, and with each update, a handful of errors occur due to mismatched IDs in the dynamic database and static quality control files

As said before more times should I show you WD? What can happen is that 2 ids are merged…

A merge will have an redirect from the old to the new… and if we speak semantics SKOS exactMatch

the problem with Wikidata is that most people are not domain experts and as it’s an open system we also get anonymous edits and vandalism….

BobBorges commented 1 year ago

I understand the reason for changes -- our issue is that part of our work involves static files, e.g. manually curated, theoretically correct data with sources, that we want to check against info extracted with new queries to wikidata.

Do I need to do something more with this, or your edit is enough?

salgo60 commented 1 year ago

@BobBorges wait and see we now have enough people I guess to get this approved… next step is to get the focus of a wiki admin which could take 1 minute or more weeks :sad:

salgo60 commented 11 months ago

FYI: I added P12192 to Template:Sweden_properties / diff and Template:Politician_properties / diff