Closed MansMeg closed 3 years ago
Current state of CI:
Cool. Maybe easier if you fix the initial CI @ninpnin ?
Considering test case 2 by Måns, the proportions of missing entries are:
name 0.0 party 0.008336491095111784 district 0.03306176582038651 chamber 0.0 start 0.0 end 0.0 occupation 0.7755778704054566 gender 0.18416066691928762 id 0.0
Where should we set the bar?
Good point! Maybe only check that there are no missing values for the variables we know should be full. I.e. id and name?
Also, maybe check that there is no reduction in the absolute number of missing values? I.e so we do not happen to remove party or similar?
I think that should be sufficient for now. Then we can add tests later on.
Add testsuite that chacks that MEP only have speeches during the period where they are actually active in parliament.
I added parla clarin validation. Now for a pull request, all changed and added XML files are tested against the Parla-Clarin schema. See #52
I added a test that checks that the name, party, district, chamber, and id columns have over 95% of the entries present.
Great! I think ID and Name need to be 100%?
Fixed.
Great! I updated the comment to capture the issues we listed here so we know when the issue is done. One of your tests, I didn't really understand.
Added a test that checks the validity of one protocol per type (ak, fk, ek, digital originals)
The last test case would necessitate running it on all protocols, right? That sounds a bit unattainable. EDIT: maybe pick 5-10 files and check that the tagged MPs in those files were active at the time.
Why not run on all protocols? How much time would such a test take?
One solution would be to run those test only in a test branch and the main branch if it takes a long time? I mean it doesn't matter to us if it runs for an hour?
Good news: I actually overestimated the time this test would take, it's actually only a minute or so. Bad news: there are some 30 protocols with wrongly tagged MPs.
Thats good news! Then we have found a bug!
Fair enough!
All of the planned CI tests are now implemented, by the way.
When we update the corpus or change it, we should do some quality check automatically using git actions. The things I can come up with are: