welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Improve the corpus API #329

Closed MansMeg closed 6 months ago

MansMeg commented 1 year ago

Before releasing version 1.0, we need to polish the corpus API to make it more intuitive and easy for users to use the data. Currently, the repo has a lot of legacy content. Below are the structure after discussions.

First, we split up the corpus into separate components. The components we be the document types:

and more internal (but still public)

Private repos:

In the data repositories, the folder structure will be the same: /data/... -> the data /test/... -> data integrity tests of the specific data /test/data/... -> data used by data integrity tests /quality_estimation/... -> scripts used for quality estimation /quality_estimation/data/... -> data used for quality estimation README.md

Reasoning R-package and python library are separate repositories and eventual use of the corpus could point to these repos for in-depth details on how to use the data. E.g. R uses vignettes, python uses examples.

Many users might be interested in just one or two of the repos. At the same time the repos might be very large. So we should simplify specialized use.

Supplementary material should only be used temporally and not part of the API. We should use different repos for different types of training data and not all in one repo.

Some additional thoughts:

salgo60 commented 1 year ago

Please let me know if we can test it and give feedback

salgo60 commented 1 year ago

Feedback Wikidata Telegram group about CC-0 licenses Egon

image image

yes, DrugBank does this: ID mappings are CCZero, rest has different license: https://go.drugbank.com/data_packages and CC0 part here: https://go.drugbank.com/releases/latest#open-data

MansMeg commented 8 months ago

Also see #384

BobBorges commented 7 months ago

Should mpdb rather be politicians-db or similar due to the inclusion of ministers?

MansMeg commented 7 months ago

Thats a good point. Im not sure about the exact names. Politicians sounds too generic to since its only members of parliament, guests speaking in parliament and ministers. Maybe we could check with political scientists?

BobBorges commented 7 months ago

too generic to since its only members of parliament, guests speaking in parliament and ministers.

riksdag_politicians

MansMeg commented 7 months ago

Still kind of generic to me? Maybe send a quick email to Jan, Josefina and Cecilia if they have a good suggestion?

ninpnin commented 7 months ago

Being more specific will make it less clear for the end users. Even mpdb was kinda obscure

ninpnin commented 6 months ago

@BobBorges is this done?

BobBorges commented 6 months ago

@MansMeg

Image