Closed MansMeg closed 2 years ago
I would be interested in opinions on this from @TomasSkotare , @ninpnin , @rbbby and maybe also from @ljo and @salgo60 that also have been part of this general discussion.
I met the Swedish Parlament people 2019 and they have some technical debts
In Wikidata we use ShEx for defining schemas used (see video when designing the one for the Swedish PM EntitySchema:E134).
One good starting point is to look at how the Wikidata project WikiProject_British_Politicians has defined PMs the person doing most of the work is Andrew Gray twitter @generalising
they have a sample query page were you can see what queries they can ask
example in WIkidata of a "politisk vilde" Amineh Kakabaveh --> Wikidata Q3675519 then she get parliamentary group "no value"
5 star data "same as"
Alias name Depends how you will use the data but many people in the Swedish PM had "other names" to disambiguate common names see SPA sj9PGLAlnmUAAAAAABgfbg
In Wikidata we have one name and many alias for every objekt --> Q5553830, fi, en, de, json
@salgo60 How comprehensive is this coverage? We rely on those extra identifiers pretty heavily in the process, as the introductions are often just "Herr X i Y:" in a lot of cases.
@ninpin my best guess is that we have it in Swedish Wikipedia articles but less often in Wikidata see video how you could extract them from python, java, javascript... (all Wikidata is CC-0)
My plan is that we should have everyone in Wikidata
@ninpnin if you could give me a list of candidates I could update Wikidata
In my video I mentioned bionomia / how it works that is using WIkidata to match Specimen data with the biologist that found the species i.e. they have the same challenge to find what person is behind a signature and he use WIkidata alias see below
How other works with Wikidata Good presentation how Bionomia are synching its data with Wikidata, ORCID (living persons) and GBIF (biodiversity data) and use Wikidata as a good resource but use wikidata at arm's length
I agree with Magnus that the aliases are a powerful tool to capture one thing being mentioned in multiple ways. The beautiful thing with Wikidata is that it is open for anyone to improve. So you can enter aliases there as you find them. If you are unsure about who they mean in the source you can compile a list and we can investigate together. We could even create an unknown person named "X from Y" in Wikidata and merge later with any of the known MPs once we find out who it is.
I suggest using https://www.wikidata.org/wiki/Property:P2561 to enter the name also (apart for adding it to alias) and add a reference to exactly where it shows up. That makes it easier for everyone to investigate anytime.
I try to think of names as a human friendly identifier. Unfortunately they are pretty bad for machines and big societies where collisions can easily occur when multiple people have exactly the same name.
This sounds promising. I tried to go through some of your links @salgo60 , although it was a little too much information for me to digest and make actionable.
So most computational researchers are familiar with csv-files, JSON and (some) XML. Hence I’m leaning toward storing persons in a csv as presented above. What do you think about that? To me that would be easy to sync with wikidata? Or do you have another suggestion? JSON?
Also, regarding Aliases, I agree it make sense to store these as well. Again for us it make sense to store it as a csv with a structure something like this (columns): person_id; alias_name; from_date; to_date
Any thoughts on this suggestion?
OT this is what Swedish Datastory does with Wikidata PM data see "The Longest Serving MP in Sweden" - tweet
They have done a lot of work updating Wikidata for people, documents from Riksdagen after 1971
Disambiguate names of Swedish PM people As long as Wikidata is not perfect and have all "special names" as alias we can test do the following
I checked the amount of pictures of Swedish PMs in portrattarkiv.se see Notebook > 5000 pictures so that can also help searching in...
@ninpnin just so we are on the same page
in this picture it says "Magnuson i Sandviken"
Question ninpnin: is this something this person was called in the Swedish PM? and that we should add as a "Tunnetaan myös nimellä" to his ruotsi Wikidata Q5971868?
Ps. a new video was published with nearly the same user case Wikidata and OCCRP (WikidataCon 2021 recording)
Another tool to find Swedish PMs is Wiki template Mall:Ledamöter_av_Sveriges_riksdag
@MansMeg I looked into the spec TEI Schema for Corpora of Parliamentary Proceedings and I guess a good start is "standardize" objects they mention --> describe them as linked data with a persistent identifier with a landing page, try to standardize this for Digital Humaniora that works with Parliamentary Proceedings in the whole Europe
I spent some time last week doing a nice table of new ministers in the German Government Scholz cabinet in sv:Wikipedia "Regeringen Scholz" and I can see that it could be a challenge to model objects over time an in the whole Europe but doing this we could start compare countries in a much better way so I guess its the way forward... --> that means that you Digital Humanist needs start to modell things together as we do in Wikidata to make different language version of Wikipedia to scale better....
....
Low hanging fruits I guess are people in ministers like
@salgo60 Yes, I think, adding 'Magnusson i Sandviken' as his alias is appropriate, in the field you suggest.
I didn't manage to find him in our data as we only have data from 1920 at the moment, but here's an example.
People are introduced with that exact alias. In the transcribed speeches, too, you will see people referred as 'Magnusson i Sandviken' or 'Anderson i Rasjön'. So the also known as/tunnetaan myös nimellä field is 100% appropriate for this type of metadata.
Thanks @salgo60 . I interpret your response as this type of CSV file with the different objects coming from the Parla-Clarin format is a good one that easily can be combined with the wikidata structures.
What do @ninpnin think about having alias as a csv/tabular file? I know we have discussed this before:
person_id; alias_name; from_date; to_date
Dont hesitate to call me 073-5152802 this is very complex but I also think game changing.... I am also on Telegram as salgo60
I didn't manage to find him in our data as we only have data from 1920 at the moment, but here's an example.
Thanks I plan to call the Swedish Riksdagens library Lotta Åberg Brorsson when they open (video with her from 2018). I guess she is more skilled on those names.... I also asked a person FBQ on sv:Wikipedia link and we found "scanned books PortraitCatalog:Tvåkammar-riksdagen 1867-" with 4500 people in the second chamber were everyone has a "special name". FBQ and I thought it was a little bit odd.... but as you said add them to the Wikidata as Alias --> will help when doing NER on names
Suggested work process as @dpriskorn suggested maybe one approach can be
As mentioned I found eg, Lista över ledamöter av Sveriges riksdags andra kammare 1914 who was active in 1914 second chamber and as you can see they have the same "problem"
See loooong video I did about this ;-) in the video I play with browser plug-in Wikidata:Entity_Explosion
person_id; alias_name; from_date; to_date
Depends what scope you have my understanding from the Bionomia developer listen at 48:28 when I asked him about his experience matching --> its normal a High Chaparral doing NER for notes about scientific findings....
ps. I also asked on the sv:Wikipedia discussion page about Riksdagens name forms "Bybrunnen#I_Riksdagen_kallad_Pettersson_i_Bjälbo,_Petersson_i_Röstånga"
Looks it was very usual in the old days in the Swedish Government to use this above name form by the "talman" and still today it can be used but mostly as an humoristic way was the answer to my question
Todays examples were SD has 2 people with the "same" name Jonas Andersson
FYI I created a Wikidata request for a new property to store the name in the Swedish PM link
This process can take some weeks...
If you support this please create a Wikipedia account (sv) (fi) and add a positive vote
syntax for positive vote is
{{S}} - ~~~~
Update looks like we can support it in another way as an user suggested
We created a sv;WIkipedia article about this name form in the Swedish PM see I_riksdagen_kallad (updated now a person deleted the table so you have it on this page)
the last part is dynamic and generated from Wikidata i.e. it is the status what we have done so far....
(update: I added a reference to this project but it was deleted ;-) see early version)
Here are some Python wrappers we might want to use for querying Wikidata
@ninpnin dont hesitate to call me if you have questions +46-735152802 or better screen sharing or telegram salgo60
==> code
# pip install sparqlwrapper
# https://rdflib.github.io/sparqlwrapper/
import sys
from SPARQLWrapper import SPARQLWrapper, JSON
endpoint_url = "https://query.wikidata.org/sparql"
query = """#title: Ledamöter med "samma I Riksdagen kallad"
SELECT DISTINCT ?nameUsedinSwedishPM1
(SAMPLE(?svWikipedia1) AS ?svWikipedia1) (SAMPLE(?svWikipedia2) AS ?svWikipedia2)
(SAMPLE(?person1) AS ?person1)
(SAMPLE(?person2) AS ?person2)
WHERE {
?person1 p:P2561 ?nameSwedishPMp1.
?person2 p:P2561 ?nameSwedishPMp2.
{
?nameSwedishPMp1 ps:P2561 ?nameUsedinSwedishPM1;
pq:P3831 wd:Q110382440.
}
{
?nameSwedishPMp2 ps:P2561 ?nameUsedinSwedishPM2;
pq:P3831 wd:Q110382440.
}
FILTER((?nameUsedinSwedishPM1 = ?nameUsedinSwedishPM2) && (?person1 != ?person2)
&& (str(?person1) > str(?person2))
)
SERVICE wikibase:label { bd:serviceParam wikibase:language "sv,en". }
OPTIONAL {
?svWikipedia1 schema:about ?person1;
schema:inLanguage "sv";
schema:isPartOf <https://sv.wikipedia.org/>.
}
OPTIONAL {
?svWikipedia2 schema:about ?person2;
schema:inLanguage "sv";
schema:isPartOf <https://sv.wikipedia.org/>.
}
}
GROUP BY ?nameUsedinSwedishPM1 ?person1 ?person1Label
ORDER BY (?nameUsedinSwedishPM1)"""
def get_results(endpoint_url, query):
user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
# TODO adjust user agent; see https://w.wiki/CX6
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query().convert()
results = get_results(endpoint_url, query)
for result in results["results"]["bindings"]:
print(result)
I use pandas a lot and has changed the code to get the data returned into pandas see Notebook example function
def get_sparql_dataframe(endpoint_url, query):
"""
Helper function to convert SPARQL results into a Pandas data frame.
"""
user_agent = "salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
result = sparql.query()
processed_results = json.load(result.response)
cols = processed_results['head']['vars']
out = []
for row in processed_results['results']['bindings']:
item = []
for c in cols:
item.append(row.get(c, {}).get('value'))
out.append(item)
return pd.DataFrame(out, columns=cols)
@ninpnin let me know if I should prioritize what people I curate in Wikidata with "I Riksdagen kallad" right now I just take them randomly. Maybe it makes more sense for you that we take people active at a specific year?
In sv:Wikipedia we have some lists (quality unknown as always with Wikipedia)
Here are some Python wrappers we might want to use for querying Wikidata
* QWikidata https://github.com/kensho-technologies/qwikidata * Wikirepo https://github.com/andrewtavis/wikirepo * Wikidata https://github.com/dahlia/wikidata
I have not tested these 3 but the best library I found until now (that is professionally maintained and covers all of Wikibase) is https://github.com/LeMyst/WikibaseIntegrator. It is very powerful and v0.12 has very nice API:s IMO. There are notebooks that showcase how to use it.
(I have contributed code and code review to the project)
I am currently working on individual level data for this issue. For some people we have multiple birth and death dates which I thought would be of interest. I looked through some references and often it is these that have conflicting information. This is generally not a problem as researchers at most will use this information on a year level, in which case there are only 2 conflicts. The problematic ones are birth data for Q18202339 (differs by 11 years) and death date for Q5718571 (1 of 2 dates has reference missing). The complete list of conflicting information is given below in case it is of further interest:
Multiple birth dates: Wikidata: ['Q18202339', 'Q4947860', 'Q5613770', 'Q5630560', 'Q5782765', 'Q5784568', 'Q5820037', 'Q5943976', 'Q5968645', 'Q6078640', 'Q6228020']
Multiple death dates: Wikidata: ['Q5556026', 'Q5563972', 'Q5718571', 'Q5799761', 'Q5937709', 'Q6022275', 'Q6042602', 'Q6228020', 'Q728197']
Also found these wikidata objects with missing start and end dates. That is dates for starting and ending the property position held (P39), taking any of the values: Q10655178, Q33071890, Q81531912 (member of enkammarriksdagen, första kammaren andra andra kammaren). Currently active members of parliament (which do not have end dates yet) are not included in these lists.
Missing start: ['Q5819783', 'Q4983135', 'Q98271639', 'Q4976825', 'Q6210385', 'Q4934552', 'Q19976148', 'Q4957371', 'Q5950466', 'Q110279970', 'Q5547315', 'Q5553916', 'Q4963592', 'Q4970175', 'Q98538839', 'Q5599215']
Missing end: ['Q98556536', 'Q98539283', 'Q98937482', 'Q98937434', 'Q98317372', 'Q97971262', 'Q97971276', 'Q98668554', 'Q6196285', 'Q16084072', 'Q5938531', 'Q5577470', 'Q98556565', 'Q98668809', 'Q5547542', 'Q5621600']
@rbbby
Thanks I have started a slow cleaning of Wikidata and adding "Tvåkammar-riksdagen 1867-1970" as a source plus adding Iriksdagenkallad see list
Question 1 what years are most important for you? I guess 1867–2021
FYI: there is also a discussion how to redesign sv:WIkipedia see WD-mall_riksdagsledamot
Question 2 the result from this project "riksdagen-corpus" can it be used to link from sv:Wikipedia? Is it described? I would be nice to have "landing pages" per
Data issues Wikidata reported by I try to walk through the list see Feedback rbby
/Magnus +46-735152802
OT: Good article about CIA World Factbook and the quality by a person Tony Bowden who tries to update Wikidata "The CIA lost track of who runs the UK, so I picked up the slack" - Tony Bowden about his efforts to build an open source dataset of world leaders inside of Wikidata - his Github @tmtmtmtm
@salgo60 great stuff thanks! The list will be of great use.
To answer some of your questions:
At the moment we are working with data from 1920, but are planning to extend to 1867. So the most important years are in that priority order.
Identifiers and link to wikidata is being worked on. Have tested it on a few years (1920, 50, 70), will be more work done it in the coming days, link: https://github.com/welfare-state-analytics/riksdagen-corpus/tree/wikidata/corpus
Btw do you know how to query for the list below? Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers. https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_vice_talm%C3%A4n_i_Sveriges_riksdag
Dont hesitate to call me and we can share screen and speak what you want to do 0735152802
But similar positions are missing for the vice speakers.
Then I suggest we create one... do we have good sources of who had those positions?
Will do and sounds good! Wikipedia seems to have references to sources for many of the individuals For example: https://sok.riksarkivet.se/Sbl/Presentation.aspx?id=7790 https://portrattarkiv.se/details/sj9PGLAlnmUAAAAAABfNvw
I found sources in statskalendern after a quick look too, but seems that its missing for some early years. We have an OCR:d version of the relevant pages here: https://github.com/welfare-state-analytics/riksdagen-ocr/tree/main/statscalender
Sounds interesting I need to learn more about what you have
I saw that file tatorter.csv has tätortskod --> easy do same as Wikidata
In Wikidata that is Property:P775 Abbekås --> Tätorts-kod T3300 --> haswbstatement:P775=T3300 --> Wikidata Q2199524 ---> sv:WIkipedia Abbekås
or use the hub tool --> /P775:T3300?lang=sv
Property:P625 is Wikidata property for coordinate -->
/P775:T3300?property=P625 --> redirect Open Street Map
Wikipedia seems to have references to sources for many of the individuals For example: https://sok.riksarkivet.se/Sbl/Presentation.aspx?id=7790 https://portrattarkiv.se/details/sj9PGLAlnmUAAAAAABfNvw
We essentially OCR:d the pages of statskalendern where information of riksdagen was present. Its about 10 pages each year. Searchable pdfs are available from other sources but can be a bit difficult to work with programmatically.
Very cool with tätorter! Not sure if we use the file for anything atm but such connections will likely be very interesting for some researchers in the future.
Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers. https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_vice_talm%C3%A4n_i_Sveriges_riksdag
I created this page Speaker of Swedish PM but as I said call me +46-735152802 so we understand what you want to do. I leave Stockholm on sunday and will be away and have less good internet...
Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers.
@rbbby I did use Open Refine and did some reconcilation and uploaded vice speakers 1867–1920 to Wikidata see video (need som QA and sources)
Todo
- Identifiers and link to wikidata is being worked on. Have tested it on a few years (1920, 50, 70), will be more work done it in the coming days, link: https://github.com/welfare-state-analytics/riksdagen-corpus/tree/wikidata/corpus
:rocket: :rocket: impressive work!!! let us know how we can help you.... @Ainali and some other people have done a lot of work in Wikidata related to the Swedish PM members/documents but this is a new very interesting level!!!!
In Wikidata we also have a project for Lexicographical data --> we store a lexem like foliehatt = Lexeme:L54865 and have usage examples, who use the word "foliehatt" and what party think other parties has "foliehattar" ;-) it would be very interesting if we easy could reference a word usage in your corpus see eg. Lexeme:L54865#P5831 were I referenced data.riksdagen.se/dokument/H80939 but would be much more interesting to use your corpus, unique identifiers and point to a specific location in the corpus.... also start gathering when foliehatt was first used in the Swedish Parlament would be interesting.... @dpriskorn has written a tool dpriskorn/LexUtils to easily find usage examples maybe that tool could use your corpus?
As Wikidata sometimes has more Swedish PM related information than sv:Wikipedia it can be good to activate a gadget "Lägg till Faktamall biografi WD i biografier" that adds a Template with Wikidata info see video
Before:
After:
Add the following line your common.js
mw.loader.load("//www.wikidata.org/w/index.php?title=User:Yair rand/WikidataInfo.js&action=raw&ctype=text/javascript");
will display the Wikidata Qnumber at the top of the Wikipedia article on sv:Wikipedia see how I did it sv.wikipedia.org/wiki/Användare:Salgo60/common.js
Also WIkidata has this possibility to add in new tools see my WD common.js and more tools
A weekly report with new properties, status of development etc. is reported
In the last Wikidata Status a new event was announced. Data Reuse Days will take place on March 14-24, highlighing applications and tools using Wikidata's data. You can already propose a session.
FYI a tool is developed for handling mismatches between Wikidata and external sources. This tool will be open and can also be used by other communities
Wikidata:Lexicographical data
In Wikidata we also have a project for Lexicographical data --> we store a lexem like foliehatt = Lexeme:L54865 and have usage examples it would be very interesting if we easy could reference a word usage in your corpus see eg. Lexeme:L54865#P5831 were I referenced data.riksdagen.se/dokument/H80939 but would be much more interesting to use your corpus, unique identifiers and point to a specific location in the corpus.... also start gathering when foliehatt was first used in the Swedish Parlament would be interesting.... @dpriskorn has written a tool dpriskorn/LexUtils to easily find usage examples maybe that tool could use your corpus?
Thanks for reminding me about this. I agree, this would be a unique and interesting source of examples. I opened up a new issue to track that idea in LexUtils.
Btw do you know how to query for the list below? Speakers of riksdagen has for example position held Q1850749. But similar positions are missing for the vice speakers. https://sv.wikipedia.org/wiki/Lista_%C3%B6ver_vice_talm%C3%A4n_i_Sveriges_riksdag
@rbbby Now also vice speakers should be in Wikidata
quality unsure - in a perfect world Wikidata had authorities we could check our data quality with and error report diffs see Wikidata:Mismatch_Finder. We have tried to start involve Riksarkivet SBL. Today we error report mismatches in a form but they lack API and version management see status overview Source:SBL I feel they lack IT knowledge to build better solutions?!?!? compare SKBL with API and structured data...
My understanding that we in the Riksdagstrycket have "talman" and you need to find who is the (vice) speaker... let me know if you find more odd things. Also if we could get your "list of name forms" with WIkidata Qnumber --> we could add them to WIkidata as alias....
After 0.3 there has been discussion on finding a way to store and handle the member of parliament, ministers etc in a structure that is easy to handle, but also flexible. Also so we can get the data on parties etc for ministers etc.
Below is how the Riksdagens Öppna data suggest that the persons are stored. Although, to me, this looks more like how to store the parliamentarians for a certain mandate period, than is not necessarily sufficient for us. Although, I think this is very similar to how we now store this data. I also think these tables are a good starting point for us.
Some problems with the current (and the Riksdagen open data are):
In the long term, we would like to handle this with separate tables (such as mop_names.csv, mop_party.csv etc). But for now we just want to be able to connect between the different files we now have and know that this is the same persons. We also want to connect this to the wiki data identifiers.
Suggestion
persons_parliament.csv
with one row per individual person that are ever talking in the parliament. the only thing I can think of that cannot be changed over time in the metadata for persons are: unique identifiers, place of birth, birthdate, place of death, dead date. Hence I think the csv-file should have the following columns:person_id
in the three files we now have. Then we could through these IDs add metadata on party and gender in the other files for now 8between the files), and then later on we could start to formalize the storage of the other variables.The Riksdagen Open Data suggested format: