PDF recognition - Githubissues

dstillman commented 6 years ago

Tentative plan:

1) When downloading a URL, either make a HEAD request first to see if the URL is a PDF or, if possible, gracefully handle PDF downloads in Zotero.HTTP.request() with a maximum download size.

2) Add another endpoint that accepts PDF data.

3) Once we have the PDF data, upload that to a new recognizer-server endpoint.

4) recognizer-server might send the PDF data to a Lambda for pdftotext processing, or it might be in Lambda itself if we move the DB from SQLite to MySQL

5) translation-server gets back identifiers from recognizer-server, runs translation on them, and returns metadata

mrtcode commented 5 years ago

So PDF must be recognized when:

1) User enters a PDF URL 2) User uploads a PDF file

For 1) we have to take over the URL if it turned out to be a PDF URL, and instead of processing it with translators, upload it to S3 and trigger further processing. Currently if Zotero.HTTP.request is set to return a document, it treats all files as a document doesn't matter if it is PDF or HTML. That's the first thing what we need to fix. I think we should check response-type if it is text/html, and only then process the content with JSDOM. Otherwise the request function should just return raw data.

Then we'll need to update processDocuments and functions that call it. There should be a condition that checks if Zotero.HTTP.request returned a document which should be passed to translators, or it returned a PDF file which should be uploaded to S3.

Next, we should limit download size, but with request.js we can only do that by manually listening on stream and counting bytes. I think the file should be limited to 50MB.

Now for option 2), the client should firstly get a signed URL from t-s, then upload a file and then query t-s again.

theFool32 commented 4 years ago

Hey, I wonder is there any progress on it? I believe this feature does make sense.

monperrus commented 4 years ago

FYI, a notable software library to extract metadata from PDFs is grobid: https://github.com/kermitt2/grobid

alexkreidler commented 1 year ago

The https://github.com/zotero/recognizer-server repo is not publicly available, apparently because it isn't self-contained: https://forums.zotero.org/discussion/80101/zotero-service-for-metadata-extraction. What external APIs does the service rely on? Stuff like AWS/GCP/Azure OCR services? Then we could figure out how to make it modular so users could use open source alternatives locally.

zotero / translation-server

PDF recognition #38