welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

OCR Quality control : parliamentary records sample 2 of 6 #455

Closed BobBorges closed 4 months ago

BobBorges commented 5 months ago

description

We need to assess quality of OCR. In order to do that we need to compare manually transposed lines from parliamentary records to the OCR output.

the task

In the attached CSV file, you will find links to randomized pages from parliamentary records that have been scanned an OCRed (under the "facs" column). Open that image in a web browser (Use your betalab credentials). Under the column "row_to_check", you will find the line number of a sampled line -- your job is to fill in the text exactly as it appears in the image in the "content" row. Take care to add the text with precision -- include any punctuation, diacritics, etc.

randomized_sample_1.csv

Lukasforell commented 5 months ago

I can take this one! However, I have a question. In the “Swerik: Quality dimensions” it says to sample three or six lines in the documents, do I, for example, paste lines 45, 46, and 47 if the “row to check” says 45 and there is only one column?

BobBorges commented 5 months ago

Hi. It will say 45, 46 and 47 if those are the lines you need to annotate. When it says 45, just that line. The sample orders have been shuffled, so you probably won't see lines from the same document next to each other, but each line to annotate is indicated in the sample.

Lukasforell commented 5 months ago

randomized_sample_1_with_content - randomized_sample_1.csv

Here it is with content.