Open dstillman opened 5 years ago
Can we wait for #59 or do we want this for the current t-s version?
So I think not only PDF but all URLs should be tried.
I.e. this doesn't work because it's returning a JSON content type.
1) If it's HTML
or XML
content type, it already goes through translation architecture, otherwise:
2) Create an empty document
3) Do a separate translation
4) If successful, return the translated metadata
5) If it's a PDF, upload and process it
6) If not a PDF, return invalid content type error
And we don't want to translate URLs that return an HTTP error code?
This can wait for #59 if that's easier.
And we don't want to translate URLs that return an HTTP error code?
I think that's right.
I already implemented a fix that does what is described in this issue, but it's based on #59, therefore it will need to wait. Another requirement is zotero/translators#1799, because the current DOI translator can't extract from URL. For now it's better to just do #72.
@mrtcode This probably isn't the right place to ask, but what is the reason that Zotero Connect can get the actual citation from something like https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf but Translation Server can't? Also I'm having a hard time figuring out how Zotero Connect does that at all...
@phiresky Zotero Connector uses 'Neural Information Processing Systems' translator which is actually slicing off the '.pdf' extension and extracting metadata from the web page behind this paper https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks . Technically translation-sever
should be capable to do the same. I think we have to fix that. Good observation.
Thanks. Here are some more examples that work fine via Zotero but not via Translation Server:
Are those the same issue?
My motivation here by the way is that I'm writing papers in markdown and I wrote a tool to transparently convert URLs to citations without having to use a reference manager: https://github.com/phiresky/pandoc-url2cite
This has been brought up again on the email list: https://groups.google.com/forum/#!msg/zotero-dev/9AmwvQqBCBY/H57ukdE9AgAJ
Related to #38, but a few translators are able to function based on the URL, even when it's a PDF page. We should try to support those cases, before either trying PDF recognition (from #38) or failing (if PDF recognition isn't enabled). This includes DOIs in the URL as well as certain sites where we recognize PDF URLs (since people sometimes click "Save to Zotero" when viewing a PDF without going back to the article page). I can try to find an example of such a translator if necessary.
This might be a little tricky, because we may need to provide a fake empty document to run detect on, but we won't want to fall back to generic webpage saving.