welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Make metadata IDs persistant #269

Closed MansMeg closed 10 months ago

MansMeg commented 1 year ago

There is a need from the wikidata people to refer to our corpus (from version 1.0) as a reference on the data. Hence we should make our ids persistent.

ninpnin commented 1 year ago

I suggest we use firstname_lastname_yyyymmdd (birthdate). It is static given that the primary name of the person and the birthdate don't change, and for the most part they shouldn't. I have also checked that there are no conflicts. On the other hand, only using birthyear leads to a handful of conflicting IDs.

If the birthday isn't available, we would use firstname_lastname_yyyymmXX or firstname_lastname_yyyyXXXX.

MansMeg commented 1 year ago

People change names so this might be confusing long term. Maybe just use a uuid? That we know will persistent.

salgo60 commented 1 year ago

I would say you should have id:s for everything parties/PM members/departments/electoral districts/subjects/.... and do like Wikidata just an id with no meaning (Q is from the name of Dennys wife Qamarniso Q61768970)

redirect

Another lesson learned is support redirects ---> When e.g. #88 Riksdagens does mistakes and adds 2 id:s for the same person (and never fix it 😱 ) its easy you also get "2 people" --> they should be merged on your side and IF the end user still have the "old id" they should find the merged target..,.. --> owl:sameAs

image
MansMeg commented 1 year ago

That sounds like a good idea. Best of both worlds. =)

BobBorges commented 1 year ago

Why are the wiki_ids not persistent? It seems like the least expensive solution (for us, since we used the QIDs in protocol documents) would be to convince wikidata to make the QIDs persistent.

MansMeg commented 1 year ago

@salgo60 know this better than me. But I think the core problem is that anyone can create a new person (hence a new id). This can then be merged. So it is a ”flaw” of the wikidata structure.

In addition, wikidata would like us to have persistant id that they could reference to. Ie our corpus will (after 1.0) be a reference for the quality control of wikidata.

I hope this explains why.

salgo60 commented 1 year ago

@MansMeg @ninpnin maybe its time for starting the process of getting persistant unique Welfare state analytics ids #269

See how Nobelprize.org redesigned its data with an API and then @miroli proposed a Wikidata id P8024 --> we can now access the WD object using the Nobelprize unique id...

salgo60 commented 1 year ago

@salgo60 know this better than me. But I think the core problem is that anyone can create a new person (hence a new id). This can then be merged. So it is a ”flaw” of the wikidata structure.

In addition, wikidata would like us to have persistant id that they could reference to. Ie our corpus will (after 1.0) be a reference for the quality control of wikidata.

I hope this explains why.

I would say that Wikidata is not designed to be the source and its better as I describe above that you have an unique persistent id as the update frequency in WD is crazy and its an open system with its strengths and weakness... also supporting > 200 languages make this equation nearly impossible and we merge a lot - see real time stream

image

The design as I understand it is not about the truth more what other sources claim --> Wikidata can also store contradicting facts...

image

  1. possibility to have more facts with contradicting values
  2. rank the preferred one
    1. see how we can track facts from Riksarkivet SBL #33 and how we also track the reason why we dont trust what Riksarkivet SBL presents like "contemporary constraint issue Q74557669" / "not confirmed by birth records Q111149276"
BobBorges commented 1 year ago

@MansMeg @ninpnin @fredrik1984 @liamtabib

We discussed persistent IDs this morning. There's already an open issue, so I didn't want to start a new one. Regardless of the format we use for the IDs, it seems like we need to obtain/create a property item on wikidata, something like SWERIK_MP_ID. According the this such an needs to be proposed and discussed "for some time" before it can be approved --- do we know @salgo60 if it's already been proposed and/or how long is "some time"? Maybe we should decide on the property name and propose it ASAP if it hasn't been done already.

There has been discussion about whether to use name/birth date or a uuid. I see the sense in using a UUID, but also sense in having a deterministic ID -- I suggest that we create a UUID deterministically using the primary name/surname and birth date as a seed (we can use pyriksdagen.utils.get_formatted_uuid as a starting point) -- best of both worlds?.

What do you all say?

liamtabib commented 1 year ago

Good idea!

MansMeg commented 1 year ago

That works for me. The only important thing is that the IDs are persistent. I.e. we need to commit to the IDs, and they will never change after they are assigned to an individual. How we create them is less important, as long as it is uuids.

I think the discussions on Wikidata will be less of a problem if we set up a persistant id, since these IDs will probably be the only persistent ids for MPs going far back in time.

salgo60 commented 1 year ago

WD need a formatter string and some examples

See how a proposal looks like that I created 11:39, 21 September 2016

https://www.wikidata.org/wiki/Wikidata:Property_proposal/SBL

Anyone can create a proposal and everyone can comment and vote on it.... my experience is that it takes some weeks to get it approved...

I am out kayaking this week and can help you when I am back but it is no rocket science so give it a try...

One thought I had if we could use Liberis-URI or the one Riksdagens has dependent were you will store your data

Landing pages

Would be nice if you had landing pages --> we could link you from Swedish Wikipedia

objects like

It's easy extracting text and pictures from Swedish Wikipedia see examples I did for people doing an app with Swedish cemeteries

OT there is a WD conference

Would be interesting if you shared you experience as researcher's how you experience working with Wikidata see tweet what is missing and can be better...

UPDATE: Wikidata modelling days 2023 looks like a researcher Daniel Mietchen is part he is also involved in designing Scholia see video

image

fredrik1984 commented 1 year ago

237

BobBorges commented 1 year ago

I'll draft a text for the Motivation part of the wikidata proposal in the next couple of days and post it here for commentary before submitting it. I think there's one unsettled issue, though. There's some consensus on using a UUID solution, but do we want to add some kind of human readable segment so it's clear that these are our UUIDs? E.g.: "SWERIK-6a28a4b0-8f46-4134-a88e-2645b704c9fc" or similar? @salgo60 @ljo any thoughts or best-practices around this?

salgo60 commented 1 year ago

1) unique is the key and and a having a human readable string maybe Will add value or just complexity 😃


Extra bonus can be done when approved a) a regular expression Property:P1793 --> we can easy catch wrong edits

^SWERIK-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$

b) URL match pattern Property:P8966 we have tools using the URL to understand what Wikidata property it relates to eg. ^http?:\/\/(?:www.)?fossilworks.org\/cgi-bin\/bridge.pl\?a=taxonInfo&taxon_no=(([1-9]\d{0,5})) relates to Property:P842 c) stability of property value Property:P2668 d) formatter URI for RDF resource Property:P1921 e) Property constraints wikidata has the possibility to add rules as unique see Help:Property_constraints_portal) f) will this PID also support lexemes? Wikidata has > 41000 swedish lexemes see example riksdagen g) owned by Property:P127 h) issue tracker URL Property:P1401 i) user manual URL Property:P2078 j) always nice to understand how its used see used by Property:P1535 I hope those PIDs will be used by Riksarkivet, Riksarkivet SBL, RAÄ, LIBRIS, Europeana, Riksdagens open data..... h) API endpoint URL Property:P6269 i) SPARQL endpoint Property:P5305 .....

salgo60 commented 1 year ago

Would be cool if we could do linked data of your Push release tests we have Software_quality_assurance property = Property:P2992

salgo60 commented 1 year ago

Good document about persistent identifiers and see also my "The Magnus list" created 2021 "One way to design a system to be a good external identifier in Wikidata" this list was mentioned by David Shorthouse at 27:50 in the Stanford video - slides "Keepin 'N Sync... with wikidata ... and ORCID...and GBIF"

image image

A Persistent Identifier (PID) policy for the European Open Science Cloud (EOSC)

image

Good design pattern use tombstone pages

image

How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

image
salgo60 commented 1 year ago

I have also tried to get Riksarkivet to support archived documents and PIDs --> status work in progress :sad::sob: maybe your project can explain that PIDs support in archives are very important for research people

Today, I perceive that there is no one else on the line when it comes to discussing persistent identifiers and how they should be supported in archives. DIGG's project does not seem to firmly decide that the National Archives and the Royal Library (KB) should handle this.

salgo60 commented 1 year ago
image
salgo60 commented 1 year ago

but do we want to add some kind of human readable segment so it's clear that these are our UUIDs

@BobBorges doi.org/10.1101/117812 states in Lesson 3. Opt for simple, durable web resolution

Trailing characters after the local ID
are discouraged as they unnecessarily increase the variability with which the identifier is represented
and also complicate straightforward appending of the local ID 
MansMeg commented 1 year ago

I think going with a pure uuid is probably the simplest. I dont see the value of adding swerik as a slug. Ideally the pid will live longer (with the vorpus) than with the swerik project name.

salgo60 commented 1 year ago

@MansMeg Isnt SWERIK used for every PID? That I feel is not a problem maybe make it easier to understad the context of the PID ... the problem I see is when doing as Riksdagen then you get problems not knowing if you find the some PID...

I hope we in Sweden will move i direction creating our resolving service something lika a Swedish DOI maybe SWEDOI


Maybe related I read this paper Introducing Innovative Indicators to Track Sweden's Open Research Data Objective: How to Measure Progress? Defining Indicators to Track Open Research Data Across Swedish Universities

image

Observer pattern

I thinks loosely coupled systems should implement the observer pattern so that you can maybe easier show citation graphs - see my suggestion to DIGG people "Best practice needed for understanding who is referencing my PID" and "#17 Vem anvÀnder en identifierare"

image

image

MansMeg commented 1 year ago

I see that point. But I doubt the swerik name will live long enough. Whatever slug we use we will have this or similar problems. Just going with a uuid is probably the easiest minimal viable uuid and would have the least long term risks, I think.

BobBorges commented 1 year ago

There's some motivation for a persistent SWERIK person ID here: https://docs.google.com/document/d/10_SEVNI7dF46hhnucTps242ntSr1nm_R3EHC7_9Mkjk/edit?usp=sharing

Modeled on @salgo60's example in scope/length/level of detail. Feel free to add any commentary directly to that google document.

MansMeg commented 1 year ago

This is excellent @BobBorges !

I will read and comment. I think this is an issue that I think we can discuss now, and then have a discussion with the TAB next Friday as a last pair of eyes before we go forward and implement.

salgo60 commented 1 year ago

I think one good motivation is with your own persistent identifier you can VERY easy start use SKOS and explain a difference with Wikidata, Riksdagens Oppna data, Riksarkivet SBL, the book "TvÄkammar Riksdagen".....

image

WIkidata merge a lot - maybe too much....

salgo60 commented 1 year ago

There's some motivation for a persistent SWERIK person ID here:

@BobBorges The best motivation I feel is FAIRDATA F1 as you produce research data ut should be FAIRDATA.

Principle F1 is arguably the most important because it will be hard to achieve other aspects of FAIR without globally unique and persistent identifiers

see also DOI 10.1101/117812

image

Other good resources

image image image
BobBorges commented 1 year ago

Thanks @salgo60! FAIR is a good thing to mention in the motivation. As someone with a research background, the R in FAIR seems the most problematic in our case now without persistent IDs -- How can we reuse and verify research findings when the primary keys of our database change regularly?

salgo60 commented 1 year ago

@BobBorges as Wikidata addictive I also would like to see the provenance - PROV of every singel data point i.e. something like a more advanced version history combined with the role of who did the change.... I.e what trust does the agent has and what data is that change based on... I feel we see that problem with "party" vilde #139 and chatGPT using PROV

image

image

one Wikidata anti-pattern

One antipattern I see in Wikidata that "every" source should confirm the birth of Selma Lagerlöf Q44519#P569 right now 23 references

image

The Wikidata model lack a Trust dimension. I asked Denny the WD designer of his point of view and wrote a blogpost about it WikidataCon 2019: We need a better model communicating quality/relevance of sources in Wikidata / Provenance

salgo60 commented 1 year ago

I did a small test using PROV with chatGPT and also show how good change tracking SPA Svensk PortrÀttarkiv has when you use the API link 139#issuecomment-1806804671

BobBorges commented 1 year ago

https://www.wikidata.org/wiki/Wikidata:Property_proposal/Person#SWERIK_Person_ID

salgo60 commented 1 year ago

If you have a Wiki account don’t hesitate to support it syntax

image

https://www.wikidata.org/wiki/Wikidata:Property_proposal/SWERIK_Person_ID

salgo60 commented 1 year ago

@BobBorges I heard comments from your statement

Wikidata IDs, however, are dynamic, and with each update, a handful of errors occur due to mismatched IDs in the dynamic database and static quality control files

As said before more times should I show you WD? What can happen is that 2 ids are merged


A merge will have an redirect from the old to the new
 and if we speak semantics SKOS exactMatch

the problem with Wikidata is that most people are not domain experts and as it’s an open system we also get anonymous edits and vandalism
.

BobBorges commented 1 year ago

I understand the reason for changes -- our issue is that part of our work involves static files, e.g. manually curated, theoretically correct data with sources, that we want to check against info extracted with new queries to wikidata.

image Do I need to do something more with this, or your edit is enough?

salgo60 commented 1 year ago

@BobBorges wait and see we now have enough people I guess to get this approved
 next step is to get the focus of a wiki admin which could take 1 minute or more weeks :sad:

salgo60 commented 11 months ago

FYI: I added P12192 to Template:Sweden_properties / diff and Template:Politician_properties / diff

image image

Feels like its wrong set up I guess you will have persistent identifiers for everything not just people as P31 indicates

image
ninpnin commented 10 months ago

@BobBorges can we close this?