Add PDF handling - Githubissues

mrtcode commented 6 years ago

Solves #38. Recognizes metadata when a PDF file is uploaded or a PDF URL is passed to /web endpoint. For this to work recognizer-server lambda branch has to be deployed.

mrtcode commented 6 years ago

A small DEMO to upload PDF file and get recognized data. REPLACE_WITH_API_GATEWAY_URL have to be replaced with the real URL. CORS have to configured for API GATEWAY or disabled in browser when testing for development purposes.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Upload</title>
</head>
<body>

<form enctype="multipart/form-data" method="post" name="fileinfo">
    <input id="file1" type="file" name="file" required/>
    <input type="button" value="Upload" onclick="upload()"/>
</form>

<script>
    async function upload() {
        let fileInput = document.getElementById('file1');
        let response = await fetch('https://REPLACE_WITH_API_GATEWAY_URL/recognize/getUploadParams', {method: 'POST'});
        let data = await response.json();

        let url = data.data.url;
        let fields = data.data.fields;

        const formData = new FormData();
        for (const key in fields) {
            formData.append(key, fields[key]);
        }

        formData.append("file", fileInput.files[0]);

        response = await fetch(url, {method: 'POST', body: formData});

        console.log(response);

        response = await fetch('https://REPLACE_WITH_API_GATEWY_URL/recognize/process', {
            method: 'POST',
            body: data.uploadID
        });

        console.log(response);
    }
</script>

</body>
</html>

dhimmel commented 5 years ago

@mrtcode this sounds like an important feature. If I understand correctly, this would be able to extract citation metadata for PDF URLs such as https://openreview.net/pdf?id=BkeCW-q6aQ.

I don't understand how this functionality relates to the s3Upload addition to the config file? Will users be required to integrate with s3 to use this feature?

mrtcode commented 5 years ago

@dhimmel this feature allows to have the same PDF recognition functionality as we have now on Zotero client. So if the client recognizes this PDF, translation-server should be able too.

s3Upload bucket is necessary for two reasons. Firstly it is used to directly upload PDF files from web browser, secondly it's an architectural choice to pass files to recognizer-server Lambda function. So, yes, for this feature to work on translation-server S3 is necessary.

dhimmel commented 5 years ago

Thanks for the info.

secondly it's an architectural choice to pass files to recognizer-server Lambda function

We're setting up a translation-server for Manubot users as per https://github.com/greenelab/manubot/issues/82. We'd prefer to have PDF recognition occur on the same server to minimize the setup complexity. What's the reasoning behind having recognizer-server operate on AWS Lambda versus wherever is running the translation-server?

dstillman commented 5 years ago

We're trying to use Lambda for translation-server too.

dstillman commented 5 years ago

(And Lambda has an input size limit, so we can't pass the file straight to recognizer-server (or to translation-server for direct PDF uploading).)

dhimmel commented 5 years ago

We're trying to use Lambda for translation-server too.

Ah I didn't realize that AWS Lambda / s3 was going to become a dependency for full functionality of the codebase.

Would it be possible for there to be an option to run recognizer-server locally? Requiring integration with AWS places a pretty high bar to entry and binds the codebase to a proprietary service.

Currently, we're running translation-server on a Google Cloud instance, but would also like to get individual Manubot instances (i.e. end users) running their own translation-servers behind the scene.

dstillman commented 5 years ago

I didn't realize that AWS Lambda / s3 was going to become a dependency for full functionality of the codebase

It isn't. translation-server itself still runs fine on Node, and will continue to do so.

recognizer-server requires a database with ongoing maintenance, so it adds a lot more complexity than translation-server on its own (and certainly wouldn't work for an end-user install). PDF recognition has never been a part of translation-server up to now, and like text search (a.k.a. identifier-search), which also requires a separate database, this just isn't something that we're considering part of public functionality at this time.

As in the Zotero client, the recommended way to save is to save from article pages, not from PDFs directly. For your example PDF, that would mean saving from https://openreview.net/forum?id=BkeCW-q6aQ instead of the PDF URL.

dhimmel commented 5 years ago

translation-server itself still runs fine on Node, and will continue to do so.

Great. Didn't realize the PDF recognition had such complexity (i.e. the separate database requirement). Perhaps at some point in the future, we can take a second look to see if a simplified PDF recognition workflow is possible.

the recommended way to save is to save from article pages, not from PDFs directly

We will make this recommendation to Manubot users as well. Although there are some PDFs, for example https://bitcoin.org/bitcoin.pdf, that are highly cited with no HTML page substitute.

mrtcode commented 5 years ago

This is merged now with the changes made by #69, also changed content-type filtering logic, plus added some minor fixes.

monperrus commented 4 years ago

PR dead or resuscitable?

zotero / translation-server

Add PDF handling #59