welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

# OCR Quality control : parliamentary records sample 1 of 6 #454

Closed BobBorges closed 8 months ago

BobBorges commented 10 months ago

description

We need to assess quality of OCR. In order to do that we need to compare manually transposed lines from parliamentary records to the OCR output.

the task

In the attached CSV file, you will find links to randomized pages from parliamentary records that have been scanned an OCRed (under the "facs" column). Open that image in a web browser (Use your betalab credentials). Under the column "row_to_check", you will find the line number of a sampled line -- your job is to fill in the text exactly as it appears in the image in the "content" row. Take care to add the text with precision -- include any punctuation, diacritics, etc.

randomized_sample_0.csv

viremn commented 9 months ago

I'll grab this one

viremn commented 9 months ago

Here are the transcribed lines. There are some documents where the line count is off by roughly 1 line, but I have not taken notes which ones, but instead supplied the content of the appropriate line. One entry for prot-1956--ak--19 is left empty. The spreadsheet is asking for line 107 but there are only 97 lines in the document.

randomized_sample_0_content_annotated.csv

MansMeg commented 9 months ago

Ok. This is not good. We should probably then not trust the counting of lines. @viremn Can you sample another line at random.

@BobBorges We should probably ask them to sample lines as well since there seem to be something wrong.

viremn commented 9 months ago

@MansMeg Sorry, what exactly do you need me to do? Do you mean resampling only for the prot-1956--ak--19, line 107? In that case I have done it in the file below. I transcribed sentence 17 instead.

randomized_sample_0_content_annotated_2.csv

BobBorges commented 9 months ago

Thanks @viremn! @MansMeg I don't think the line numbers / number of lines are/were used for anything other than generating a pseudo random sample of lines on a given page, so we should be good with 17 instead of 107.

MansMeg commented 9 months ago

So that I understand. Did someone else count the number of lines in the pdf? How could we otherwise end up with 107 for 97 lines?

I think it is easier and better to count the lines manually for now and take a sample size of three. In R we can do it like this:

> num_lines <- 97;sample(1:num_lines, size = 3)

I took an additional sample line for you @viremn : line 27

viremn commented 9 months ago

From what I understand, in the previous step of the quality control process someone counted the lines manually, and then 3 lines from each single-column document and 6 from each double-column document were sampled to be transcribed. I already changed the one which was out of bounds for the document to 17. @MansMeg, what is this line 27 for? We already have 6 sampled lines for that particular document, I believe.

BobBorges commented 9 months ago

Hi @viremn I don't think it matters whether it's line 17 or 27 -- @MansMeg what do you say? Please feel free to claim another one of these samples if you're looking for something to do. There are still three up for grabs.

... and in case r looks weird to you, you can do the same in python:

>>> from random import sample
>>> n_lines = 97
>>> sample(list(range(1,n_lines+1)), 3)
[1, 61, 66]
MansMeg commented 9 months ago

Its ok with 17.

MansMeg commented 9 months ago

But please mark this somehow so we know which line it is and how you sampled it.