welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Better schema testing #395

Closed BobBorges closed 3 months ago

BobBorges commented 9 months ago

Currently the schema unit test runs on one random file per year. We should at least run the schema test locally once on the whole corpus to find files that don't fit the schema instead of inconsistently passing/failing unit test when working on other issues. Then update workflows:

BobBorges commented 7 months ago

Schema test is currently run on all changed/created files on PR. Maybe that's better than push, since we shouldn't be pushing edited protocols directly to dev or main anyway. What do you say @MansMeg?

BobBorges commented 7 months ago

On push, there's a random sample. We should just add the validation check for all files on release.

ninpnin commented 6 months ago

This on push testing is already implemented. Problem is, sometimes the diffs are so large that the list of changed files can't be generated on Github actions

ninpnin commented 6 months ago

See: https://github.com/welfare-state-analytics/riksdagen-corpus/actions/runs/7347998353/workflow#L33

ninpnin commented 6 months ago

Testing the whole corpus against the schema is quicker now that we have the full AK/FK/EK (sub)corpus files. It still takes like 20 minutes per subcorpus, though, so we might want to restrict when we run it.

BobBorges commented 6 months ago

I get the same error if I run the test locally on the same set of files, so I don't think it's a problem with actions.

ninpnin commented 6 months ago

The shell can only take arguments up to 1048576 characters, i.e. 20k files. Is it that this time?