Try translating PDF URLs based on URL

zotero / translation-server

A Node.js-based server to run Zotero translators

Other

123 stars 52 forks source link

Try translating PDF URLs based on URL #70

Open dstillman opened 5 years ago

dstillman commented 5 years ago

Related to #38, but a few translators are able to function based on the URL, even when it's a PDF page. We should try to support those cases, before either trying PDF recognition (from #38) or failing (if PDF recognition isn't enabled). This includes DOIs in the URL as well as certain sites where we recognize PDF URLs (since people sometimes click "Save to Zotero" when viewing a PDF without going back to the article page). I can try to find an example of such a translator if necessary.

This might be a little tricky, because we may need to provide a fake empty document to run detect on, but we won't want to fall back to generic webpage saving.

mrtcode commented 5 years ago

Can we wait for #59 or do we want this for the current t-s version?

So I think not only PDF but all URLs should be tried.

I.e. this doesn't work because it's returning a JSON content type.

1) If it's HTML or XML content type, it already goes through translation architecture, otherwise: 2) Create an empty document 3) Do a separate translation 4) If successful, return the translated metadata 5) If it's a PDF, upload and process it 6) If not a PDF, return invalid content type error

And we don't want to translate URLs that return an HTTP error code?

dstillman commented 5 years ago

This can wait for #59 if that's easier.

And we don't want to translate URLs that return an HTTP error code?

I think that's right.

mrtcode commented 5 years ago

I already implemented a fix that does what is described in this issue, but it's based on #59, therefore it will need to wait. Another requirement is zotero/translators#1799, because the current DOI translator can't extract from URL. For now it's better to just do #72.

phiresky commented 5 years ago

@mrtcode This probably isn't the right place to ask, but what is the reason that Zotero Connect can get the actual citation from something like https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf but Translation Server can't? Also I'm having a hard time figuring out how Zotero Connect does that at all...

mrtcode commented 5 years ago

@phiresky Zotero Connector uses 'Neural Information Processing Systems' translator which is actually slicing off the '.pdf' extension and extracting metadata from the web page behind this paper https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks . Technically translation-sever should be capable to do the same. I think we have to fix that. Good observation.

phiresky commented 5 years ago

Thanks. Here are some more examples that work fine via Zotero but not via Translation Server:

https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44321.pdf
http://sci-hub.tw/https://ieeexplore.ieee.org/abstract/document/8666636 (I guess this is a different problem?)
http://img.cs.uec.ac.jp/pub/conf17/171024ege_0.pdf

Are those the same issue?

My motivation here by the way is that I'm writing papers in markdown and I wrote a tool to transparently convert URLs to citations without having to use a reference manager: https://github.com/phiresky/pandoc-url2cite

mvolz commented 4 years ago

This has been brought up again on the email list: https://groups.google.com/forum/#!msg/zotero-dev/9AmwvQqBCBY/H57ukdE9AgAJ