welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

adding new metadata file -- baseline number of MPs per parliament year/chamber/session #373

Closed BobBorges closed 11 months ago

BobBorges commented 11 months ago

I'll use this file in several planned unit tests.

Columns:

There isn't a source for every year -- years with no source currently have no number listed. I have been inferring missing numbers; if no authoritative source mentioned a change,missing numbers are considered the same as the previous year.

MansMeg commented 11 months ago

This looks good. But we should not keep this in the metadata since it should be used for testing. Ie put it in quality assesment folder instead.

long term we should maybe more this type of unit test data to the test folder instead.

BobBorges commented 11 months ago

we should not keep this in the metadata

fair enough

long term

Let's do it now if it's how it should be. How about test/input/ for this type of file? @MansMeg

MansMeg commented 11 months ago

The long-term solution is something we can discuss tomorrow. The main things to consider: 1) It should not be part of the API of the corpus (i.e. not part of the corpus folder). i.e. it is not intended for ordinary users. 2) It should be intuitive (i.e. I should be able to find it simply if I know what I'm looking for). I.e. not generic folder names such as "input". It should probably be in a folder structure like "data_integrity_tests_data" or similar (but a better name =) ). 3) It should be part of the corpus general repository to simplify continuous integration. At least for now.

The only thing I think is important for now is not to put it in the corpus folder (ie 1).

@ninpnin , any thoughts?

BobBorges commented 11 months ago

moved file to corpus/quality_assessment/ in #4ceb99c

MansMeg commented 11 months ago

I think it is better if you add the inferred number in the file. I think you are right, it is correct to infer as you do. But it might be better to have a full dataset for this? Otherwise, we need to do this inference every time someone uses the file. Or? What do you think?

BobBorges commented 11 months ago

coding with a complete file will definitely be easier -- I just didn't want to add information that we can't tie to a source.

MansMeg commented 11 months ago

Ok. I have no strong opinion here. But when you have the source as a column, you could just set "inferred" there to indicate that it is not from a source. Do as you like.

ninpnin commented 11 months ago

LGTM