yalelibrary / YUL-DC

Preliminary issue tracking for Yale University Libraries Digital Collections project
3 stars 0 forks source link

SPIKE: Look for JSS full text files #1387

Closed mikeapp closed 3 years ago

mikeapp commented 3 years ago

Story

We may want to bring full text content over from JSS, if we can find it. Are there files on disk, or full-text data streams in the Fedora objects?

Acceptance

MaggieZhaoYale commented 3 years ago

We did not find ocr/full text on the fedora server, and no search result return ocr. The schema.xml indicates text_rus, text_cze, text_gre, text_noneng were the fieldtype for fulltext. The solr query by these fields return 0;

We reached out Marena and Eric James who left the fulltext comment on the schema.xml. Here is his response: I believe there was OCR, and it was in an indexed Fedora3 datastream. The full text might have been embedded in the PDF too (also a Fedora3 datastream), I don’t totally remember. updated response from Eric: did not see the OCR datastream: http://jss.library.yale.edu/fedora/objects/slavicbooks:282283/datastreams

Marena forwarded a link of an article about the process of digitizing JSS https://www.tandfonline.com/doi/full/10.1080/15228886.2012.706213 Here's what the article says about OCR:

"The three vendors delivered digital equivalents by shipping external drives or using a secure file-transfer system. These files were then checked by the project team to ensure that vendors fulfilled the requirements defined in the technical specifications. Quality control covered image processing (i.e., page splitting, canvas-size consistency, deskewing, despeckling), as well as the accuracy of the optical character recognition (OCR), which affects PDF searchability."

kcompsci commented 3 years ago

What we've discovered is that the PDF is being used as full text search in JSS. There is no separate data stream in Fedora that stores it. Eric James (one of the original developer of the interface) also confirmed that he didn't see one in Fedora although originally he thought there was.

mikeapp commented 3 years ago

I see now, there's a PDF per page. Let me consider next steps.