Closed BobBorges closed 9 months ago
Not posting a sample yet because there is another batch to go in here.
sth is wrong
Wasn't there missing protocols also in the 1970ies?
Yes, they're coming...
What's up with the modified protocol files?
They were new yesterday, modified them today. With exception of a couple of those 1924 protocols; there was ak with an fk file name and ak missing -- new files added, correct files renamed.
metadata unit test failure is expected at this point.. will figure out the next-prev issue
@ninpnin feat: make div ID deterministic
Should these div IDs be reset for all documents that used the script before your edit?
If there are any unpushed new IDs, I would regenerate them. If not, we don't gain anything by changing them again.
The unit tests should pass now. @ninpnin here are 5 of 100 newly added files -- pls take a look at them.
corpus/protocols/1975/prot-1975--076.xml
corpus/protocols/197677/prot-197677--043.xml
corpus/protocols/197677/prot-197677--050.xml
corpus/protocols/197677/prot-197677--058.xml
corpus/protocols/199495/prot-199495--123.xml
The unit test fails because:
@MansMeg @ninpnin Any input on handling this? I could try to reOCR->pipeline->etc on the problem protocol to see if that solves the problem, but the fact that that the OCR software just lost a whole quarter of a page is a bit concerning. Alternatively, I could restore the old version of protocols that aren't new, but re-OCR was supposed to improve the overall quality.
Re OCR without preprocessing (deskew, grayscale, thresholding) spits out an alto file with the missing date in question :|
Do we need to do the preprocessing? I thought tesseract does pretty well on its own
Or were they OCRd by KB?
No we don't need preprocessing I suppose -- I thought these steps could only improve the result -- apparently not.
The OCR issue has been solved on my end -- the "deskew" step to straighten up crooked scans was turning some of the images 90º, essentially cropping the top/bottom of the document -- the fact that tesseract still managed to read the vertically oriented text suggests it's not a necessary step, so omitting it from the worflow and rerunning ocr for the documents I have been working on.
Great! Could you check which pages were turned 90 degrees?
I think it would be easier to just re-do everything -- it's like 3 mins of active work to ocr->pipeline->rest of curation per year of protocols. The missing ones are concentrated in the late 1970s and early 1990s.
Ok. Sounds good to me.
some code has been fetched from other branches that are still under review, but this PR is mostly about the protocol files.