Closed BobBorges closed 6 months ago
GitHub links in the csv throw 404 errors: "The main branch of riksdagen-corpus does not contain the path"
oeps. indeed. since generating these files, we have changed the names of the files by zero padding record numbers
from:
prot-198081--17.xml
to:
prot-198081--017.xml
Is it manageable to add the zeros as you go?
In the OCR annotation, you shouldn't really be looking at the XML files anyway
I am finding broken betalab links (e.g. https://betalab.kb.se/prot-1962-höst-fk--34/prot_1962_höst_fk__34-033.jp2/_view) due to the csv format. What should I do about these?
Replace ö
with ö
and it should work.
description
We need to assess quality of OCR. In order to do that we need to compare manually transposed lines from parliamentary records to the OCR output.
the task
In the attached CSV file, you will find links to randomized pages from parliamentary records that have been scanned an OCRed (under the "facs" column). Open that image in a web browser (Use your betalab credentials). Under the column "row_to_check", you will find the line number of a sampled line -- your job is to fill in the text exactly as it appears in the image in the "content" row. Take care to add the text with precision -- include any punctuation, diacritics, etc.
randomized_sample_2.csv