Improve the corpus API - Githubissues

MansMeg commented 1 year ago

Before releasing version 1.0, we need to polish the corpus API to make it more intuitive and easy for users to use the data. Currently, the repo has a lot of legacy content. Below are the structure after discussions.

First, we split up the corpus into separate components. The components we be the document types:

riksdag_records
riksdag_records-alto
riksdag_records-pdf
riksdag_motions
riksdag_motions-alto
riksdag_motions-pdf ...
riksdag_mpdb (mp database)
rpackage
pylib

and more internal (but still public)

src
papers

Private repos:

supplementary-material
supplementary-material-xyz (maybe just use "sm" as an abbreviation for supplementary material?)

In the data repositories, the folder structure will be the same: /data/... -> the data /test/... -> data integrity tests of the specific data /test/data/... -> data used by data integrity tests /quality_estimation/... -> scripts used for quality estimation /quality_estimation/data/... -> data used for quality estimation README.md

Reasoning R-package and python library are separate repositories and eventual use of the corpus could point to these repos for in-depth details on how to use the data. E.g. R uses vignettes, python uses examples.

Many users might be interested in just one or two of the repos. At the same time the repos might be very large. So we should simplify specialized use.

Supplementary material should only be used temporally and not part of the API. We should use different repos for different types of training data and not all in one repo.

Some additional thoughts:

[x] Look up the FAIR dataset information and see if we need to add stuff to the README. Also, there are some structures on simplifying robots to index our dataset.

salgo60 commented 1 year ago

Please let me know if we can test it and give feedback

[ ] we would like to see a copyright statement - if wikidata should use it we need CC-0 see Wikidata:licensing(not sure we need all the data) I think we should link most of your data (see POC done 2022 where add an extra tab) but some data I think gain from a CC-0 license that we can reuse....
- I focus now on IFK Göteborg player data see rep salgo60/ifkdb ;-) and suggested today that they maybe can have a smaller subset of CC-0 data released in a dataset... like we could have for Swedish PM people...

salgo60 commented 1 year ago

Feedback Wikidata Telegram group about CC-0 licenses Egon

yes, DrugBank does this: ID mappings are CCZero, rest has different license: https://go.drugbank.com/data_packages and CC0 part here: https://go.drugbank.com/releases/latest#open-data

MansMeg commented 8 months ago

Also see #384

BobBorges commented 7 months ago

Should mpdb rather be politicians-db or similar due to the inclusion of ministers?

MansMeg commented 7 months ago

Thats a good point. Im not sure about the exact names. Politicians sounds too generic to since its only members of parliament, guests speaking in parliament and ministers. Maybe we could check with political scientists?

BobBorges commented 7 months ago

too generic to since its only members of parliament, guests speaking in parliament and ministers.

riksdag_politicians

MansMeg commented 7 months ago

Still kind of generic to me? Maybe send a quick email to Jan, Josefina and Cecilia if they have a good suggestion?

ninpnin commented 7 months ago

Being more specific will make it less clear for the end users. Even mpdb was kinda obscure

ninpnin commented 6 months ago

@BobBorges is this done?

BobBorges commented 6 months ago

@MansMeg

welfare-state-analytics / riksdagen-corpus

Improve the corpus API #329