welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

OCR Quality control : parliamentary records sample 3 of 6 #456

Closed BobBorges closed 6 months ago

BobBorges commented 8 months ago

description

We need to assess quality of OCR. In order to do that we need to compare manually transposed lines from parliamentary records to the OCR output.

the task

In the attached CSV file, you will find links to randomized pages from parliamentary records that have been scanned an OCRed (under the "facs" column). Open that image in a web browser (Use your betalab credentials). Under the column "row_to_check", you will find the line number of a sampled line -- your job is to fill in the text exactly as it appears in the image in the "content" row. Take care to add the text with precision -- include any punctuation, diacritics, etc.

randomized_sample_2.csv

todomoldovan commented 8 months ago

GitHub links in the csv throw 404 errors: "The main branch of riksdagen-corpus does not contain the path"

BobBorges commented 8 months ago

oeps. indeed. since generating these files, we have changed the names of the files by zero padding record numbers

from:

prot-198081--17.xml

to:

prot-198081--017.xml

Is it manageable to add the zeros as you go?

ninpnin commented 8 months ago

In the OCR annotation, you shouldn't really be looking at the XML files anyway

todomoldovan commented 7 months ago

I am finding broken betalab links (e.g. https://betalab.kb.se/prot-1962-höst-fk--34/prot_1962_höst_fk__34-033.jp2/_view) due to the csv format. What should I do about these?

BobBorges commented 7 months ago

Replace √∂ with ö and it should work.

todomoldovan commented 7 months ago

randomized_sample_2.csv