welfare-state-analytics / riksdagen-corpus-old

Preprocess the proceedings of the Swedish parliament
https://welfare-state-analytics.github.io/riksdagen-corpus/riksdagen_corpus/
8 stars 3 forks source link

Adding curation? #42

Closed MansMeg closed 3 years ago

MansMeg commented 3 years ago

So if I understand this correctly we just give information on what we want to curate. But I guess some people might want to just change the XML files? Is that a correct way to do it?

ninpnin commented 3 years ago

Directly changing the XML files would inevitably make the curations sequential / build on top of previous versions, wouldn't it?

MansMeg commented 3 years ago

Yes. But how do we solve this in the best way? Say that I want to fix some errors. The easiest for me is to correct it in the PR and then do a PR? Because I also want to use the same correct XML file?

Can we extract the corrections from a diff?

ninpnin commented 3 years ago

Can we extract the corrections from a diff?

Yes, but there would need to be some restrictions to that, for example "no edits can add or remove paragraphs". That might be trivial or difficult, I'm not 100% sure.

MansMeg commented 3 years ago

Sure. I think we should focus on making it easy for people to curate and add corrections.

ninpnin commented 3 years ago

Another way to allow orthogonality would be splitting the protocols into multiple files, and then building solely on top of those files.

MansMeg commented 3 years ago

Yes. But can that be done in a reasonable way? I guess a protocol is the smallest unit?

ninpnin commented 3 years ago

The division into text areas is very consistent, that could be used. But then it starts to be a looot of files, ~3-4x of the total page count.

ninpnin commented 3 years ago

It created a test repository for this approach. 1.5M files. Git status, add and commit took maybe 3-5 seconds each.

MansMeg commented 3 years ago

Hmmm. Yes, that seem to be too much. I guess protocoll is the smallest unit. Any other ideas how to get a good pipeline? The goal is twofold:

  1. Make it easy to contribute
  2. Make it easy to check a contribution (ideally through CI)
  3. Make it traceable to the original files

Maybe the simplest is just to go even further. Just have code to generate the final files and the parlaclarin files. Then we can trace back using just git? No additional files at all?

ninpnin commented 3 years ago

Implemented protocol-by-protocol.