welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Setup person identifier for persons speaking in the parliament #103

Closed MansMeg closed 2 years ago

MansMeg commented 2 years ago

After 0.3 there has been discussion on finding a way to store and handle the member of parliament, ministers etc in a structure that is easy to handle, but also flexible. Also so we can get the data on parties etc for ministers etc.

Below is how the Riksdagens Öppna data suggest that the persons are stored. Although, to me, this looks more like how to store the parliamentarians for a certain mandate period, than is not necessarily sufficient for us. Although, I think this is very similar to how we now store this data. I also think these tables are a good starting point for us.

Some problems with the current (and the Riksdagen open data are):

In the long term, we would like to handle this with separate tables (such as mop_names.csv, mop_party.csv etc). But for now we just want to be able to connect between the different files we now have and know that this is the same persons. We also want to connect this to the wiki data identifiers.

Suggestion

The Riksdagen Open Data suggested format:

CREATE TABLE person (
intressent_id varchar(20),
född_år smallint,
kön varchar(6),
efternamn nvarchar(50),
tilltalsnamn nvarchar(50),
sorteringsnamn varchar(80),
iort varchar(40),
parti varchar(40),
valkrets varchar(50),
status varchar(100),
);

CREATE TABLE personuppdrag (
organ_kod varchar(20),
roll_kod varchar(40),
ordningsnummer int,
status varchar(20),
typ varchar(20),
[from] datetime,
tom datetime,
uppgift varchar(500),
intressent_id varchar(50)
);

CREATE TABLE personuppgift (
uppgift_kod varchar(50),
uppgift ntext,
uppgift_typ varchar(50),
intressent_id varchar(50)
);
MansMeg commented 2 years ago

I would be interested in opinions on this from @TomasSkotare , @ninpnin , @rbbby and maybe also from @ljo and @salgo60 that also have been part of this general discussion.

salgo60 commented 2 years ago

I met the Swedish Parlament people 2019 and they have some technical debts

In Wikidata we use ShEx for defining schemas used (see video when designing the one for the Swedish PM EntitySchema:E134).

image

One good starting point is to look at how the Wikidata project WikiProject_British_Politicians has defined PMs the person doing most of the work is Andrew Gray twitter @generalising

image

5 star data "same as"

image

image graph

salgo60 commented 2 years ago

Alias name Depends how you will use the data but many people in the Swedish PM had "other names" to disambiguate common names see SPA sj9PGLAlnmUAAAAAABgfbg

image

In Wikidata we have one name and many alias for every objekt --> Q5553830, fi, en, de, json

image

image

ninpnin commented 2 years ago

@salgo60 How comprehensive is this coverage? We rely on those extra identifiers pretty heavily in the process, as the introductions are often just "Herr X i Y:" in a lot of cases.

salgo60 commented 2 years ago

@ninpin my best guess is that we have it in Swedish Wikipedia articles but less often in Wikidata see video how you could extract them from python, java, javascript... (all Wikidata is CC-0)

image

My plan is that we should have everyone in Wikidata

@ninpnin if you could give me a list of candidates I could update Wikidata

In my video I mentioned bionomia / how it works that is using WIkidata to match Specimen data with the biologist that found the species i.e. they have the same challenge to find what person is behind a signature and he use WIkidata alias see below

How other works with Wikidata Good presentation how Bionomia are synching its data with Wikidata, ORCID (living persons) and GBIF (biodiversity data) and use Wikidata as a good resource but use wikidata at arm's length

image

image

dpriskorn commented 2 years ago

I agree with Magnus that the aliases are a powerful tool to capture one thing being mentioned in multiple ways. The beautiful thing with Wikidata is that it is open for anyone to improve. So you can enter aliases there as you find them. If you are unsure about who they mean in the source you can compile a list and we can investigate together. We could even create an unknown person named "X from Y" in Wikidata and merge later with any of the known MPs once we find out who it is.

I suggest using https://www.wikidata.org/wiki/Property:P2561 to enter the name also (apart for adding it to alias) and add a reference to exactly where it shows up. That makes it easier for everyone to investigate anytime.

I try to think of names as a human friendly identifier. Unfortunately they are pretty bad for machines and big societies where collisions can easily occur when multiple people have exactly the same name.

MansMeg commented 2 years ago

This sounds promising. I tried to go through some of your links @salgo60 , although it was a little too much information for me to digest and make actionable.

So most computational researchers are familiar with csv-files, JSON and (some) XML. Hence I’m leaning toward storing persons in a csv as presented above. What do you think about that? To me that would be easy to sync with wikidata? Or do you have another suggestion? JSON?

Also, regarding Aliases, I agree it make sense to store these as well. Again for us it make sense to store it as a csv with a structure something like this (columns): person_id; alias_name; from_date; to_date

Any thoughts on this suggestion?

salgo60 commented 2 years ago

OT this is what Swedish Datastory does with Wikidata PM data see "The Longest Serving MP in Sweden" - tweet

image

They have done a lot of work updating Wikidata for people, documents from Riksdagen after 1971

Disambiguate names of Swedish PM people As long as Wikidata is not perfect and have all "special names" as alias we can test do the following

I checked the amount of pictures of Swedish PMs in portrattarkiv.se see Notebook > 5000 pictures so that can also help searching in... image

salgo60 commented 2 years ago

@ninpnin just so we are on the same page

in this picture it says "Magnuson i Sandviken" image

Question ninpnin: is this something this person was called in the Swedish PM? and that we should add as a "Tunnetaan myös nimellä" to his ruotsi Wikidata Q5971868?

Ps. a new video was published with nearly the same user case Wikidata and OCCRP (WikidataCon 2021 recording)

Another tool to find Swedish PMs is Wiki template Mall:Ledamöter_av_Sveriges_riksdag

image

salgo60 commented 2 years ago

@MansMeg I looked into the spec TEI Schema for Corpora of Parliamentary Proceedings and I guess a good start is "standardize" objects they mention --> describe them as linked data with a persistent identifier with a landing page, try to standardize this for Digital Humaniora that works with Parliamentary Proceedings in the whole Europe

image

I spent some time last week doing a nice table of new ministers in the German Government Scholz cabinet in sv:Wikipedia "Regeringen Scholz" and I can see that it could be a challenge to model objects over time an in the whole Europe but doing this we could start compare countries in a much better way so I guess its the way forward... --> that means that you Digital Humanist needs start to modell things together as we do in Wikidata to make different language version of Wikipedia to scale better....

image ....

Low hanging fruits I guess are people in ministers like

image

ninpnin commented 2 years ago

@salgo60 Yes, I think, adding 'Magnusson i Sandviken' as his alias is appropriate, in the field you suggest.

I didn't manage to find him in our data as we only have data from 1920 at the moment, but here's an example.

kuva

People are introduced with that exact alias. In the transcribed speeches, too, you will see people referred as 'Magnusson i Sandviken' or 'Anderson i Rasjön'. So the also known as/tunnetaan myös nimellä field is 100% appropriate for this type of metadata.

MansMeg commented 2 years ago

Thanks @salgo60 . I interpret your response as this type of CSV file with the different objects coming from the Parla-Clarin format is a good one that easily can be combined with the wikidata structures.

What do @ninpnin think about having alias as a csv/tabular file? I know we have discussed this before:

person_id; alias_name; from_date; to_date
salgo60 commented 2 years ago

Dont hesitate to call me 073-5152802 this is very complex but I also think game changing.... I am also on Telegram as salgo60

I didn't manage to find him in our data as we only have data from 1920 at the moment, but here's an example.

Thanks I plan to call the Swedish Riksdagens library Lotta Åberg Brorsson when they open (video with her from 2018). I guess she is more skilled on those names.... I also asked a person FBQ on sv:Wikipedia link and we found "scanned books PortraitCatalog:Tvåkammar-riksdagen 1867-" with 4500 people in the second chamber were everyone has a "special name". FBQ and I thought it was a little bit odd.... but as you said add them to the Wikidata as Alias --> will help when doing NER on names

Suggested work process as @dpriskorn suggested maybe one approach can be

  1. have a list of people in the Swedish PM and the "alias" you find
  2. match them if possible to Wikidata Qnumber
  3. if no match found then we can create a Wikidata stub i.e. just an object that have what we know, when active in the Swedish PM etc.
    1. Mark the object so we easy find it and can curate it later and also in your list mark it as a WD stub with Qnumber xxx
    2. When we find who it is we merge it with the better object

As mentioned I found eg, Lista över ledamöter av Sveriges riksdags andra kammare 1914 who was active in 1914 second chamber and as you can see they have the same "problem"

See loooong video I did about this ;-) in the video I play with browser plug-in Wikidata:Entity_Explosion

salgo60 commented 2 years ago

person_id; alias_name; from_date; to_date

Depends what scope you have my understanding from the Bionomia developer listen at 48:28 when I asked him about his experience matching --> its normal a High Chaparral doing NER for notes about scientific findings....

image

image

ps. I also asked on the sv:Wikipedia discussion page about Riksdagens name forms "Bybrunnen#I_Riksdagen_kallad_Pettersson_i_Bjälbo,_Petersson_i_Röstånga"

salgo60 commented 2 years ago

Looks it was very usual in the old days in the Swedish Government to use this above name form by the "talman" and still today it can be used but mostly as an humoristic way was the answer to my question

image

Todays examples were SD has 2 people with the "same" name Jonas Andersson

image

salgo60 commented 2 years ago

FYI I created a Wikidata request for a new property to store the name in the Swedish PM link

image

This process can take some weeks...

If you support this please create a Wikipedia account (sv) (fi) and add a positive vote

syntax for positive vote is

{{S}} - ~~~~ 

Update looks like we can support it in another way as an user suggested

image

image

salgo60 commented 2 years ago

We created a sv;WIkipedia article about this name form in the Swedish PM see I_riksdagen_kallad (updated now a person deleted the table so you have it on this page)

image

the last part is dynamic and generated from Wikidata i.e. it is the status what we have done so far....

(update: I added a reference to this project but it was deleted ;-) see early version) image

ninpnin commented 2 years ago

Here are some Python wrappers we might want to use for querying Wikidata

salgo60 commented 2 years ago

@ninpnin dont hesitate to call me if you have questions +46-735152802 or better screen sharing or telegram salgo60

image

==> code

# pip install sparqlwrapper
# https://rdflib.github.io/sparqlwrapper/

import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

query = """#title:  Ledamöter med "samma I Riksdagen kallad"
SELECT DISTINCT ?nameUsedinSwedishPM1 
(SAMPLE(?svWikipedia1) AS ?svWikipedia1) (SAMPLE(?svWikipedia2) AS ?svWikipedia2) 
(SAMPLE(?person1) AS ?person1) 
(SAMPLE(?person2) AS ?person2)
WHERE {
  ?person1 p:P2561 ?nameSwedishPMp1.
  ?person2 p:P2561 ?nameSwedishPMp2.
  {
    ?nameSwedishPMp1 ps:P2561 ?nameUsedinSwedishPM1;
      pq:P3831 wd:Q110382440.
  }
  {
    ?nameSwedishPMp2 ps:P2561 ?nameUsedinSwedishPM2;
      pq:P3831 wd:Q110382440.
  }
  FILTER((?nameUsedinSwedishPM1 = ?nameUsedinSwedishPM2) && (?person1 != ?person2) 
        && (str(?person1) > str(?person2))
        )
  SERVICE wikibase:label { bd:serviceParam wikibase:language "sv,en". }
  OPTIONAL {
    ?svWikipedia1 schema:about ?person1;
      schema:inLanguage "sv";
      schema:isPartOf <https://sv.wikipedia.org/>.
  }
  OPTIONAL {
    ?svWikipedia2 schema:about ?person2;
      schema:inLanguage "sv";
      schema:isPartOf <https://sv.wikipedia.org/>.
  }
}
GROUP BY ?nameUsedinSwedishPM1 ?person1 ?person1Label
ORDER BY (?nameUsedinSwedishPM1)"""

def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
    # TODO adjust user agent; see https://w.wiki/CX6
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

results = get_results(endpoint_url, query)

for result in results["results"]["bindings"]:
    print(result)

I use pandas a lot and has changed the code to get the data returned into pandas see Notebook example function


def get_sparql_dataframe(endpoint_url, query):
    """
    Helper function to convert SPARQL results into a Pandas data frame.
    """
    user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])

    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)

    return pd.DataFrame(out, columns=cols)
salgo60 commented 2 years ago

@ninpnin let me know if I should prioritize what people I curate in Wikidata with "I Riksdagen kallad" right now I just take them randomly. Maybe it makes more sense for you that we take people active at a specific year?

In sv:Wikipedia we have some lists (quality unknown as always with Wikipedia)

image

dpriskorn commented 2 years ago

Here are some Python wrappers we might want to use for querying Wikidata

* QWikidata https://github.com/kensho-technologies/qwikidata

* Wikirepo https://github.com/andrewtavis/wikirepo

* Wikidata https://github.com/dahlia/wikidata

I have not tested these 3 but the best library I found until now (that is professionally maintained and covers all of Wikibase) is https://github.com/LeMyst/WikibaseIntegrator. It is very powerful and v0.12 has very nice API:s IMO. There are notebooks that showcase how to use it.

(I have contributed code and code review to the project)

rbbby commented 2 years ago

I am currently working on individual level data for this issue. For some people we have multiple birth and death dates which I thought would be of interest. I looked through some references and often it is these that have conflicting information. This is generally not a problem as researchers at most will use this information on a year level, in which case there are only 2 conflicts. The problematic ones are birth data for Q18202339 (differs by 11 years) and death date for Q5718571 (1 of 2 dates has reference missing). The complete list of conflicting information is given below in case it is of further interest:

Multiple birth dates: Wikidata: ['Q18202339', 'Q4947860', 'Q5613770', 'Q5630560', 'Q5782765', 'Q5784568', 'Q5820037', 'Q5943976', 'Q5968645', 'Q6078640', 'Q6228020']

Multiple death dates: Wikidata: ['Q5556026', 'Q5563972', 'Q5718571', 'Q5799761', 'Q5937709', 'Q6022275', 'Q6042602', 'Q6228020', 'Q728197']

rbbby commented 2 years ago

Also found these wikidata objects with missing start and end dates. That is dates for starting and ending the property position held (P39), taking any of the values: Q10655178, Q33071890, Q81531912 (member of enkammarriksdagen, första kammaren andra andra kammaren). Currently active members of parliament (which do not have end dates yet) are not included in these lists.

Missing start: ['Q5819783', 'Q4983135', 'Q98271639', 'Q4976825', 'Q6210385', 'Q4934552', 'Q19976148', 'Q4957371', 'Q5950466', 'Q110279970', 'Q5547315', 'Q5553916', 'Q4963592', 'Q4970175', 'Q98538839', 'Q5599215']

Missing end: ['Q98556536', 'Q98539283', 'Q98937482', 'Q98937434', 'Q98317372', 'Q97971262', 'Q97971276', 'Q98668554', 'Q6196285', 'Q16084072', 'Q5938531', 'Q5577470', 'Q98556565', 'Q98668809', 'Q5547542', 'Q5621600']

salgo60 commented 2 years ago

@rbbby

Thanks I have started a slow cleaning of Wikidata and adding "Tvåkammar-riksdagen 1867-1970" as a source plus adding Iriksdagenkallad see list

image

Question 1 what years are most important for you? I guess 1867–2021

FYI: there is also a discussion how to redesign sv:WIkipedia see WD-mall_riksdagsledamot

Question 2 the result from this project "riksdagen-corpus" can it be used to link from sv:Wikipedia? Is it described? I would be nice to have "landing pages" per

Data issues Wikidata reported by I try to walk through the list see Feedback rbby

/Magnus +46-735152802

OT: Good article about CIA World Factbook and the quality by a person Tony Bowden who tries to update Wikidata "The CIA lost track of who runs the UK, so I picked up the slack" - Tony Bowden about his efforts to build an open source dataset of world leaders inside of Wikidata - his Github @tmtmtmtm

rbbby commented 2 years ago

@salgo60 great stuff thanks! The list will be of great use.

To answer some of your questions:

Btw do you know how to query for the list below? Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers. https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_vice_talm%C3%A4n_i_Sveriges_riksdag

salgo60 commented 2 years ago

Dont hesitate to call me and we can share screen and speak what you want to do 0735152802

But similar positions are missing for the vice speakers.

Then I suggest we create one... do we have good sources of who had those positions?

rbbby commented 2 years ago

Will do and sounds good! Wikipedia seems to have references to sources for many of the individuals For example: https://sok.riksarkivet.se/Sbl/Presentation.aspx?id=7790 https://portrattarkiv.se/details/sj9PGLAlnmUAAAAAABfNvw

I found sources in statskalendern after a quick look too, but seems that its missing for some early years. We have an OCR:d version of the relevant pages here: https://github.com/welfare-state-analytics/riksdagen-ocr/tree/main/statscalender

salgo60 commented 2 years ago

Sounds interesting I need to learn more about what you have

I saw that file tatorter.csv has tätortskod --> easy do same as Wikidata

image

In Wikidata that is Property:P775 Abbekås --> Tätorts-kod T3300 --> haswbstatement:P775=T3300 --> Wikidata Q2199524 ---> sv:WIkipedia Abbekås

or use the hub tool --> /P775:T3300?lang=sv

image

Property:P625 is Wikidata property for coordinate -->

/P775:T3300?property=P625 --> redirect Open Street Map

image

salgo60 commented 2 years ago

Wikipedia seems to have references to sources for many of the individuals For example: https://sok.riksarkivet.se/Sbl/Presentation.aspx?id=7790 https://portrattarkiv.se/details/sj9PGLAlnmUAAAAAABfNvw

image

rbbby commented 2 years ago

We essentially OCR:d the pages of statskalendern where information of riksdagen was present. Its about 10 pages each year. Searchable pdfs are available from other sources but can be a bit difficult to work with programmatically.

Very cool with tätorter! Not sure if we use the file for anything atm but such connections will likely be very interesting for some researchers in the future.

salgo60 commented 2 years ago

Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers. https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_vice_talm%C3%A4n_i_Sveriges_riksdag

I created this page Speaker of Swedish PM but as I said call me +46-735152802 so we understand what you want to do. I leave Stockholm on sunday and will be away and have less good internet...

image

salgo60 commented 2 years ago

Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers.

@rbbby I did use Open Refine and did some reconcilation and uploaded vice speakers 1867–1920 to Wikidata see video (need som QA and sources)

image

image

Todo

image

salgo60 commented 2 years ago

image

:rocket: :rocket: impressive work!!! let us know how we can help you.... @Ainali and some other people have done a lot of work in Wikidata related to the Swedish PM members/documents but this is a new very interesting level!!!!

Wikidata:Lexicographical data

In Wikidata we also have a project for Lexicographical data --> we store a lexem like foliehatt = Lexeme:L54865 and have usage examples, who use the word "foliehatt" and what party think other parties has "foliehattar" ;-) it would be very interesting if we easy could reference a word usage in your corpus see eg. Lexeme:L54865#P5831 were I referenced data.riksdagen.se/dokument/H80939 but would be much more interesting to use your corpus, unique identifiers and point to a specific location in the corpus.... also start gathering when foliehatt was first used in the Swedish Parlament would be interesting.... @dpriskorn has written a tool dpriskorn/LexUtils to easily find usage examples maybe that tool could use your corpus?

image

Wikipedia advice

As Wikidata sometimes has more Swedish PM related information than sv:Wikipedia it can be good to activate a gadget "Lägg till Faktamall biografi WD i biografier" that adds a Template with Wikidata info see video

image

Before: image

After: image

Wikipedia advice 2

Add the following line your common.js

mw.loader.load("//www.wikidata.org/w/index.php?title=User:Yair rand/WikidataInfo.js&action=raw&ctype=text/javascript");

will display the Wikidata Qnumber at the top of the Wikipedia article on sv:Wikipedia see how I did it sv.wikipedia.org/wiki/Användare:Salgo60/common.js

image

Also WIkidata has this possibility to add in new tools see my WD common.js and more tools

Wikidata Status

A weekly report with new properties, status of development etc. is reported

Data Reuse Days announced March 14-24

In the last Wikidata Status a new event was announced. Data Reuse Days will take place on March 14-24, highlighing applications and tools using Wikidata's data. You can already propose a session. image

Tool for tracking mismatches Wikidata:Mismatch Finder

FYI a tool is developed for handling mismatches between Wikidata and external sources. This tool will be open and can also be used by other communities

dpriskorn commented 2 years ago

Wikidata:Lexicographical data

In Wikidata we also have a project for Lexicographical data --> we store a lexem like foliehatt = Lexeme:L54865 and have usage examples it would be very interesting if we easy could reference a word usage in your corpus see eg. Lexeme:L54865#P5831 were I referenced data.riksdagen.se/dokument/H80939 but would be much more interesting to use your corpus, unique identifiers and point to a specific location in the corpus.... also start gathering when foliehatt was first used in the Swedish Parlament would be interesting.... @dpriskorn has written a tool dpriskorn/LexUtils to easily find usage examples maybe that tool could use your corpus?

image

Thanks for reminding me about this. I agree, this would be a unique and interesting source of examples. I opened up a new issue to track that idea in LexUtils.

salgo60 commented 2 years ago

Btw do you know how to query for the list below? Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers. https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_vice_talm%C3%A4n_i_Sveriges_riksdag

@rbbby Now also vice speakers should be in Wikidata

vice speakers

image

all speakers

quality unsure - in a perfect world Wikidata had authorities we could check our data quality with and error report diffs see Wikidata:Mismatch_Finder. We have tried to start involve Riksarkivet SBL. Today we error report mismatches in a form but they lack API and version management see status overview Source:SBL I feel they lack IT knowledge to build better solutions?!?!? compare SKBL with API and structured data...

image

My understanding that we in the Riksdagstrycket have "talman" and you need to find who is the (vice) speaker... let me know if you find more odd things. Also if we could get your "list of name forms" with WIkidata Qnumber --> we could add them to WIkidata as alias....