Closed mikeapp closed 3 years ago
We did not find ocr/full text on the fedora server, and no search result return ocr. The schema.xml indicates text_rus, text_cze, text_gre, text_noneng were the fieldtype for fulltext. The solr query by these fields return 0;
We reached out Marena and Eric James who left the fulltext comment on the schema.xml. Here is his response: I believe there was OCR, and it was in an indexed Fedora3 datastream. The full text might have been embedded in the PDF too (also a Fedora3 datastream), I don’t totally remember. updated response from Eric: did not see the OCR datastream: http://jss.library.yale.edu/fedora/objects/slavicbooks:282283/datastreams
Marena forwarded a link of an article about the process of digitizing JSS https://www.tandfonline.com/doi/full/10.1080/15228886.2012.706213 Here's what the article says about OCR:
"The three vendors delivered digital equivalents by shipping external drives or using a secure file-transfer system. These files were then checked by the project team to ensure that vendors fulfilled the requirements defined in the technical specifications. Quality control covered image processing (i.e., page splitting, canvas-size consistency, deskewing, despeckling), as well as the accuracy of the optical character recognition (OCR), which affects PDF searchability."
What we've discovered is that the PDF is being used as full text search in JSS. There is no separate data stream in Fedora that stores it. Eric James (one of the original developer of the interface) also confirmed that he didn't see one in Fedora although originally he thought there was.
I see now, there's a PDF per page. Let me consider next steps.
Story
We may want to bring full text content over from JSS, if we can find it. Are there files on disk, or full-text data streams in the Fedora objects?
Acceptance