welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Re-OCR debate records since 1971 #418

Closed fredrik1984 closed 8 months ago

fredrik1984 commented 9 months ago

The digitized debate records from the unicameral era probably need to be re-OCRed. They were digitized mid-2000s (the bicameral documents were digitized later in the mid-2010s). Around mid-1990s all debate records have been born digital.

ninpnin commented 8 months ago

I think we already use the tesseract versions. For example prot-197576--008, points to https://betalab.kb.se/prot-197576--8/prot_197576__8-000.xml which says

<softwareName>tesseract 4.1.1</softwareName>
MansMeg commented 8 months ago

Great!

Is this the case for all protocols? Then maybe we dont need to do this. This might come from Pelle that looked at the ocr at the parliament that doesnt use Tesseract.

MansMeg commented 8 months ago

Ie we might be able to close this issue.

fredrik1984 commented 8 months ago

I think we can close the issue. I opened it because I forgot that we already have re-OCRed the protocols, and that our quality is significantly better between our protocols in the XML files and the same protocols on the Riksdag web.

MansMeg commented 8 months ago

Cool!