Closed BobBorges closed 8 months ago
I'll grab this one
Here are the transcribed lines. There are some documents where the line count is off by roughly 1 line, but I have not taken notes which ones, but instead supplied the content of the appropriate line. One entry for prot-1956--ak--19 is left empty. The spreadsheet is asking for line 107 but there are only 97 lines in the document.
Ok. This is not good. We should probably then not trust the counting of lines. @viremn Can you sample another line at random.
@BobBorges We should probably ask them to sample lines as well since there seem to be something wrong.
@MansMeg Sorry, what exactly do you need me to do? Do you mean resampling only for the prot-1956--ak--19, line 107? In that case I have done it in the file below. I transcribed sentence 17 instead.
Thanks @viremn! @MansMeg I don't think the line numbers / number of lines are/were used for anything other than generating a pseudo random sample of lines on a given page, so we should be good with 17 instead of 107.
So that I understand. Did someone else count the number of lines in the pdf? How could we otherwise end up with 107 for 97 lines?
I think it is easier and better to count the lines manually for now and take a sample size of three. In R we can do it like this:
> num_lines <- 97;sample(1:num_lines, size = 3)
I took an additional sample line for you @viremn : line 27
From what I understand, in the previous step of the quality control process someone counted the lines manually, and then 3 lines from each single-column document and 6 from each double-column document were sampled to be transcribed. I already changed the one which was out of bounds for the document to 17. @MansMeg, what is this line 27 for? We already have 6 sampled lines for that particular document, I believe.
Hi @viremn I don't think it matters whether it's line 17 or 27 -- @MansMeg what do you say? Please feel free to claim another one of these samples if you're looking for something to do. There are still three up for grabs.
... and in case r looks weird to you, you can do the same in python:
>>> from random import sample
>>> n_lines = 97
>>> sample(list(range(1,n_lines+1)), 3)
[1, 61, 66]
Its ok with 17.
But please mark this somehow so we know which line it is and how you sampled it.
description
We need to assess quality of OCR. In order to do that we need to compare manually transposed lines from parliamentary records to the OCR output.
the task
In the attached CSV file, you will find links to randomized pages from parliamentary records that have been scanned an OCRed (under the "facs" column). Open that image in a web browser (Use your betalab credentials). Under the column "row_to_check", you will find the line number of a sampled line -- your job is to fill in the text exactly as it appears in the image in the "content" row. Take care to add the text with precision -- include any punctuation, diacritics, etc.
randomized_sample_0.csv