welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Predict OCR error #403

Open MansMeg opened 10 months ago

MansMeg commented 10 months ago

We want to predict which documents have poorer OCR errors to reOCR certain parts of the document. @MansMeg has submitted this as a data science project and hence have more detailed information.

SchermanJ commented 10 months ago

For what it's worth I've cleaned up every headline of every motion from 1971 here in Wikidata: https://w.wiki/7u6o

You can compare it with the OCR errors in the Riksdagen scan from the same year and that might give some clues.

SchermanJ commented 10 months ago

Here are my earlier observations regarding OCR errors. #65

MansMeg commented 10 months ago

Thats very valuable!

SchermanJ commented 9 months ago

And now I've done the same for 1972: https://w.wiki/84EP

MansMeg commented 9 months ago

Nice!