formerly-missing protocols

welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today

Other

26 stars 5 forks source link

formerly-missing protocols #473

Closed BobBorges closed 9 months ago

BobBorges commented 9 months ago

some code has been fetched from other branches that are still under review, but this PR is mostly about the protocol files.

BobBorges commented 9 months ago

Not posting a sample yet because there is another batch to go in here.

BobBorges commented 9 months ago

sth is wrong

MansMeg commented 9 months ago

Wasn't there missing protocols also in the 1970ies?

BobBorges commented 9 months ago

Yes, they're coming...

ninpnin commented 9 months ago

What's up with the modified protocol files?

BobBorges commented 9 months ago

They were new yesterday, modified them today. With exception of a couple of those 1924 protocols; there was ak with an fk file name and ak missing -- new files added, correct files renamed.

BobBorges commented 9 months ago

metadata unit test failure is expected at this point.. will figure out the next-prev issue

BobBorges commented 9 months ago

@ninpnin feat: make div ID deterministic

Should these div IDs be reset for all documents that used the script before your edit?

ninpnin commented 9 months ago

If there are any unpushed new IDs, I would regenerate them. If not, we don't gain anything by changing them again.

BobBorges commented 9 months ago

The unit tests should pass now. @ninpnin here are 5 of 100 newly added files -- pls take a look at them.

corpus/protocols/1975/prot-1975--076.xml
corpus/protocols/197677/prot-197677--043.xml
corpus/protocols/197677/prot-197677--050.xml
corpus/protocols/197677/prot-197677--058.xml
corpus/protocols/199495/prot-199495--123.xml

BobBorges commented 9 months ago

The unit test fails because:

1975 protocols were re-OCRed in their entirety
I only pushed new protocols
the fresh OCR (or something in the pipeline) skipped a whole clump of a page where there happened to be a date (see image)
dates for the unit test were scraped and pushed based on my local working dir, which doesn't match origin

@MansMeg @ninpnin Any input on handling this? I could try to reOCR->pipeline->etc on the problem protocol to see if that solves the problem, but the fact that that the OCR software just lost a whole quarter of a page is a bit concerning. Alternatively, I could restore the old version of protocols that aren't new, but re-OCR was supposed to improve the overall quality.

BobBorges commented 9 months ago

Re OCR without preprocessing (deskew, grayscale, thresholding) spits out an alto file with the missing date in question :|

ninpnin commented 9 months ago

Do we need to do the preprocessing? I thought tesseract does pretty well on its own

ninpnin commented 9 months ago

Or were they OCRd by KB?

BobBorges commented 9 months ago

No we don't need preprocessing I suppose -- I thought these steps could only improve the result -- apparently not.

BobBorges commented 9 months ago

The OCR issue has been solved on my end -- the "deskew" step to straighten up crooked scans was turning some of the images 90º, essentially cropping the top/bottom of the document -- the fact that tesseract still managed to read the vertically oriented text suggests it's not a necessary step, so omitting it from the worflow and rerunning ocr for the documents I have been working on.

MansMeg commented 9 months ago

Great! Could you check which pages were turned 90 degrees?

BobBorges commented 9 months ago

I think it would be easier to just re-do everything -- it's like 3 mins of active work to ocr->pipeline->rest of curation per year of protocols. The missing ones are concentrated in the late 1970s and early 1990s.

MansMeg commented 9 months ago

Ok. Sounds good to me.