Open mrtcode opened 6 years ago
A small DEMO to upload PDF file and get recognized data. REPLACE_WITH_API_GATEWAY_URL
have to be replaced with the real URL. CORS have to configured for API GATEWAY or disabled in browser when testing for development purposes.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Upload</title>
</head>
<body>
<form enctype="multipart/form-data" method="post" name="fileinfo">
<input id="file1" type="file" name="file" required/>
<input type="button" value="Upload" onclick="upload()"/>
</form>
<script>
async function upload() {
let fileInput = document.getElementById('file1');
let response = await fetch('https://REPLACE_WITH_API_GATEWAY_URL/recognize/getUploadParams', {method: 'POST'});
let data = await response.json();
let url = data.data.url;
let fields = data.data.fields;
const formData = new FormData();
for (const key in fields) {
formData.append(key, fields[key]);
}
formData.append("file", fileInput.files[0]);
response = await fetch(url, {method: 'POST', body: formData});
console.log(response);
response = await fetch('https://REPLACE_WITH_API_GATEWY_URL/recognize/process', {
method: 'POST',
body: data.uploadID
});
console.log(response);
}
</script>
</body>
</html>
@mrtcode this sounds like an important feature. If I understand correctly, this would be able to extract citation metadata for PDF URLs such as https://openreview.net/pdf?id=BkeCW-q6aQ.
I don't understand how this functionality relates to the s3Upload
addition to the config file? Will users be required to integrate with s3 to use this feature?
@dhimmel this feature allows to have the same PDF recognition functionality as we have now on Zotero client. So if the client recognizes this PDF, translation-server
should be able too.
s3Upload
bucket is necessary for two reasons. Firstly it is used to directly upload PDF files from web browser, secondly it's an architectural choice to pass files to recognizer-server
Lambda function. So, yes, for this feature to work on translation-server
S3 is necessary.
Thanks for the info.
secondly it's an architectural choice to pass files to
recognizer-server
Lambda function
We're setting up a translation-server for Manubot users as per https://github.com/greenelab/manubot/issues/82. We'd prefer to have PDF recognition occur on the same server to minimize the setup complexity. What's the reasoning behind having recognizer-server operate on AWS Lambda versus wherever is running the translation-server?
We're trying to use Lambda for translation-server too.
(And Lambda has an input size limit, so we can't pass the file straight to recognizer-server (or to translation-server for direct PDF uploading).)
We're trying to use Lambda for translation-server too.
Ah I didn't realize that AWS Lambda / s3 was going to become a dependency for full functionality of the codebase.
Would it be possible for there to be an option to run recognizer-server
locally? Requiring integration with AWS places a pretty high bar to entry and binds the codebase to a proprietary service.
Currently, we're running translation-server on a Google Cloud instance, but would also like to get individual Manubot instances (i.e. end users) running their own translation-servers behind the scene.
I didn't realize that AWS Lambda / s3 was going to become a dependency for full functionality of the codebase
It isn't. translation-server itself still runs fine on Node, and will continue to do so.
recognizer-server requires a database with ongoing maintenance, so it adds a lot more complexity than translation-server on its own (and certainly wouldn't work for an end-user install). PDF recognition has never been a part of translation-server up to now, and like text search (a.k.a. identifier-search), which also requires a separate database, this just isn't something that we're considering part of public functionality at this time.
As in the Zotero client, the recommended way to save is to save from article pages, not from PDFs directly. For your example PDF, that would mean saving from https://openreview.net/forum?id=BkeCW-q6aQ instead of the PDF URL.
translation-server itself still runs fine on Node, and will continue to do so.
Great. Didn't realize the PDF recognition had such complexity (i.e. the separate database requirement). Perhaps at some point in the future, we can take a second look to see if a simplified PDF recognition workflow is possible.
the recommended way to save is to save from article pages, not from PDFs directly
We will make this recommendation to Manubot users as well. Although there are some PDFs, for example https://bitcoin.org/bitcoin.pdf, that are highly cited with no HTML page substitute.
This is merged now with the changes made by #69, also changed content-type
filtering logic, plus added some minor fixes.
PR dead or resuscitable?
Solves #38. Recognizes metadata when a PDF file is uploaded or a PDF URL is passed to
/web
endpoint. For this to work recognizer-server lambda branch has to be deployed.