welfare-state-analytics / riksdagen-corpus-old

Preprocess the proceedings of the Swedish parliament
https://welfare-state-analytics.github.io/riksdagen-corpus/riksdagen_corpus/
8 stars 3 forks source link

Setup a test suite and git actions for CI #31

Closed MansMeg closed 3 years ago

MansMeg commented 3 years ago

When we update the corpus or change it, we should do some quality check automatically using git actions. The things I can come up with are:

ninpnin commented 3 years ago

Current state of CI:

  1. Validate the parla clarin example file against the parla-clarin schema
  2. Validate one-protocol-long parla clarin corpus generated by our code against the parla-clarin schema
MansMeg commented 3 years ago

Cool. Maybe easier if you fix the initial CI @ninpnin ?

ninpnin commented 3 years ago

Considering test case 2 by Måns, the proportions of missing entries are:

name 0.0 party 0.008336491095111784 district 0.03306176582038651 chamber 0.0 start 0.0 end 0.0 occupation 0.7755778704054566 gender 0.18416066691928762 id 0.0

Where should we set the bar?

MansMeg commented 3 years ago

Good point! Maybe only check that there are no missing values for the variables we know should be full. I.e. id and name?

Also, maybe check that there is no reduction in the absolute number of missing values? I.e so we do not happen to remove party or similar?

I think that should be sufficient for now. Then we can add tests later on.

MansMeg commented 3 years ago

Add testsuite that chacks that MEP only have speeches during the period where they are actually active in parliament.

ninpnin commented 3 years ago

I added parla clarin validation. Now for a pull request, all changed and added XML files are tested against the Parla-Clarin schema. See #52

ninpnin commented 3 years ago

I added a test that checks that the name, party, district, chamber, and id columns have over 95% of the entries present.

MansMeg commented 3 years ago

Great! I think ID and Name need to be 100%?

ninpnin commented 3 years ago

Fixed.

MansMeg commented 3 years ago

Great! I updated the comment to capture the issues we listed here so we know when the issue is done. One of your tests, I didn't really understand.

ninpnin commented 3 years ago

Added a test that checks the validity of one protocol per type (ak, fk, ek, digital originals)

ninpnin commented 3 years ago

The last test case would necessitate running it on all protocols, right? That sounds a bit unattainable. EDIT: maybe pick 5-10 files and check that the tagged MPs in those files were active at the time.

MansMeg commented 3 years ago

Why not run on all protocols? How much time would such a test take?

One solution would be to run those test only in a test branch and the main branch if it takes a long time? I mean it doesn't matter to us if it runs for an hour?

ninpnin commented 3 years ago

Good news: I actually overestimated the time this test would take, it's actually only a minute or so. Bad news: there are some 30 protocols with wrongly tagged MPs.

MansMeg commented 3 years ago

Thats good news! Then we have found a bug!

ninpnin commented 3 years ago

Fair enough!

All of the planned CI tests are now implemented, by the way.