welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Formalize the API for the corpus #99

Closed MansMeg closed 2 years ago

MansMeg commented 2 years ago

We need to set up a more long term API for the corpus. With a data API, I mean the structure on how the data should be stored and that we can build upon in research. The focus on this API to clarify how the data is structured and simplify use of the corpus.

Currently, I see the "API" as:

Current problems with the API:

Here are some suggestions on how to improve the API and its documentation (short-term):

@ninpnin and @rbbby : Any thoughts on this?

rbbby commented 2 years ago

Saving notes from meeting for discussion on monday. A bunch of small file with their contents are listed, which then is joined into an observation level file.

MansMeg commented 2 years ago

I think the original issue already contains a lot of my thoughts and comments. Some additional comments/suggestions:

MansMeg commented 2 years ago

Re. @rbbby s comment above I have the following comments:

Otherwise, I think this solves many parts of the API.

MansMeg commented 2 years ago
ninpnin commented 2 years ago
rbbby commented 2 years ago

Here are the new queries and csv files generated from wikidata. No cleaning is done at this stage other than renaming/dropping columns and formatting dates (applying assumptions and joining files is next in the pipeline). Please comment if you find something we should change with regards to the API discussion above.

One note is that time data in member.csv, minister.csv, speaker.csv and prime-minister.csv all are split up into their own queries and files. This is due to member.csv and minister.csv having the variables party and government respectively, and both variables being exclusive to their respective datasets. So the four files are kept separately to keep them sparse, and speaker.csv is not joined with prime-minister.csv as it would not be intuitive to join just the two of them and could cause "mental overhead".

https://github.com/welfare-state-analytics/riksdagen-corpus/tree/api/input/wikidata

rbbby commented 2 years ago