Formalize the API for the corpus

MansMeg commented 2 years ago

We need to set up a more long term API for the corpus. With a data API, I mean the structure on how the data should be stored and that we can build upon in research. The focus on this API to clarify how the data is structured and simplify use of the corpus.

Currently, I see the "API" as:

Annual protocol files in the corpus/ folder
List of MPs corpus/members_of_parliament.csv
List of ministers corpus/ministers.csv
List of speakers of the house corpus/talman.csv

Current problems with the API:

There are a lot of files in the corpus folder that I think is not actually files the end user are interested in to use. We should move this type of "helper" data out of the API so there is no confusion
Now we have one folder per year and the MOP file in the same folder. This is confusing, especially since it is so many folders. Also now we refer to the mop as the corpus. When I read Sinikallio et al (2021) they make a more clear separation, which I think is good.
It is hard to find documentation in the direct contact with the data (What does this file contain?) that the end user will use.

Here are some suggestions on how to improve the API and its documentation (short-term):

[x] We decide on an "API-folder" that is the folder end users will go into (now the corpus folder). Long-term, I think this folder will be the one we will then will use as the repository from version 1.0. All code, helper data etc is stuff that users don't want to download when they just want the corpus. Even though a little old-fashioned, I think XML and CSV are probably the two formats we would like to use for files in this API.
[x] In the "API-folder" folder we have subfolders for the different objects that will be part of the corpus: protocols, persons, districts, motions, propositions, etc (but for now we settle with persons and protocols) See Sinikallio et al (2021) here.
[x] We go through the current data and discuss what parts is good also for end-users (such as districts and maybe name in parliament) and discuss what part of the data that should be included in the API.
[x] We put a README in each folder that points to the github wiki for documentation (or just put the documentation in the README)
[x] We separate out "helper data" out from the formal "API-folder" to another folder.

@ninpnin and @rbbby : Any thoughts on this?

rbbby commented 2 years ago

Saving notes from meeting for discussion on monday. A bunch of small file with their contents are listed, which then is joined into an observation level file.

[x] individual.csv wiki_id, born, dead, gender
[x] party.csv wiki_id, party, start, end
[x] name.csv wiki_id, name
[x] twitter.csv wiki_id, handle
[x] government.csv wiki_id, government, start, end
[x] minister.csv wiki_id, role, start, end
[x] talman.csv wiki_id, role, start, end
[x] member.csv wiki_id, role, start, end, district, party

MansMeg commented 2 years ago

I think the original issue already contains a lot of my thoughts and comments. Some additional comments/suggestions:

[x] The folder "corpus" is the API
[x] The folder names should describe the content of the folder (i.e. "wikidata" -> "metadata", all years with protocols -> "protocols")
[x] We should only have CSV-files (in a good normal form, 5?) and TEI XML files for textual data.
[x] From 0.4 and onward we should set up some general design decisions in the corpus README.md (or Wikipedia) where the decisions are written out. Like only CSV and XML, no redundant data in the corpus folder (API), we use our own keys in the corpus, how we do semantic versioning etc.
[x] I think scaffolding data and files should be moved to a folder called something like "temporary_data". This is important and might be moved into the corpus folder, but that should be in an orderly fashion with an issue of adding it and maybe a PR. It should be possible to use the data without using temporary data.

MansMeg commented 2 years ago

Re. @rbbby s comment above I have the following comments:

[x] We should use our own ide, so person_id instead of wiki_id. I also think it is good that the column names are informative.
[x] I think some covariates are not changing over time, such as born, dead, gender (if we use wikidata gender) so these can be part of a persons.csv or equivalent.

Otherwise, I think this solves many parts of the API.

MansMeg commented 2 years ago

[x] Rename the csv files to minimize mental overhead (party.csv -> party_affiliations.csv)
[x] We should separate different "variables" into separate tables (speaker_data.csv -> speakers.csv + twitter_handles.csv etc). Minimize sparsity. If something leads to sparsity it should be its own table.
[x] These principles should be summarized in a README

ninpnin commented 2 years ago

[x] Include data source ('wikidata'/'statskalender'/'personregister')

rbbby commented 2 years ago

Here are the new queries and csv files generated from wikidata. No cleaning is done at this stage other than renaming/dropping columns and formatting dates (applying assumptions and joining files is next in the pipeline). Please comment if you find something we should change with regards to the API discussion above.

One note is that time data in member.csv, minister.csv, speaker.csv and prime-minister.csv all are split up into their own queries and files. This is due to member.csv and minister.csv having the variables party and government respectively, and both variables being exclusive to their respective datasets. So the four files are kept separately to keep them sparse, and speaker.csv is not joined with prime-minister.csv as it would not be intuitive to join just the two of them and could cause "mental overhead".

https://github.com/welfare-state-analytics/riksdagen-corpus/tree/api/input/wikidata

rbbby commented 2 years ago

[x] Make order in files deterministic (in scripts/wikidata_query)

welfare-state-analytics / riksdagen-corpus

Formalize the API for the corpus #99