Open dstillman opened 6 years ago
So PDF must be recognized when:
1) User enters a PDF URL 2) User uploads a PDF file
For 1) we have to take over the URL if it turned out to be a PDF URL, and instead of processing it with translators, upload it to S3 and trigger further processing. Currently if Zotero.HTTP.request
is set to return a document, it treats all files as a document doesn't matter if it is PDF or HTML. That's the first thing what we need to fix. I think we should check response-type
if it is text/html
, and only then process the content with JSDOM. Otherwise the request
function should just return raw data.
Then we'll need to update processDocuments
and functions that call it. There should be a condition that checks if Zotero.HTTP.request
returned a document which should be passed to translators, or it returned a PDF file which should be uploaded to S3.
Next, we should limit download size, but with request.js
we can only do that by manually listening on stream and counting bytes. I think the file should be limited to 50MB.
Now for option 2), the client should firstly get a signed URL from t-s, then upload a file and then query t-s again.
Hey, I wonder is there any progress on it? I believe this feature does make sense.
FYI, a notable software library to extract metadata from PDFs is grobid: https://github.com/kermitt2/grobid
The https://github.com/zotero/recognizer-server repo is not publicly available, apparently because it isn't self-contained: https://forums.zotero.org/discussion/80101/zotero-service-for-metadata-extraction. What external APIs does the service rely on? Stuff like AWS/GCP/Azure OCR services? Then we could figure out how to make it modular so users could use open source alternatives locally.
Tentative plan:
1) When downloading a URL, either make a HEAD request first to see if the URL is a PDF or, if possible, gracefully handle PDF downloads in
Zotero.HTTP.request()
with a maximum download size.2) Add another endpoint that accepts PDF data.
3) Once we have the PDF data, upload that to a new recognizer-server endpoint.
4) recognizer-server might send the PDF data to a Lambda for pdftotext processing, or it might be in Lambda itself if we move the DB from SQLite to MySQL
5) translation-server gets back identifiers from recognizer-server, runs translation on them, and returns metadata