zotero / translation-server

A Node.js-based server to run Zotero translators
Other
122 stars 51 forks source link

Remove support for `multiple`? #44

Closed dstillman closed 5 years ago

dstillman commented 5 years ago

The follow-up request to a 300 for multiple results (i.e., search results, generally) needs to hit the same instance or else it gets 409 due to the session not existing.

In Lambda, we need to call another Lambda that stores the state, due to Amazon VPC restrictions. We should probably also support a config option for non-Lambda installations that takes a Redis host to store in directly.

dstillman commented 5 years ago

It turns out there's not a great way to support multiple results in a multi-server environment, since the session includes the actual callback from the translator, which can't be serialized. So the way things work now, the follow-up request needs to hit the same instance. For a server installation this could be done with sticky sessions on the load balancer, but that's a fairly exacting deployment requirement, and it wouldn't work on Lambda.

Most translators just return URLs or identifiers for the keys in the multiple results, and those would often be able to work without state (i.e., if we just translated the URL or identifier), though it's possible they could still rely on cookies from the initial request. But any translator that returns internal identifiers or indexes wouldn't work. In those cases, it's possible that repeating the initial search would work, but 1) that would require making duplicate requests and 2) there's no guarantee that the keys would be the same they second time.

The bigger question here is whether we need to support multiple results in translation-server at all. If you're pasting a URL into ZoteroBib, there's not really much reason to paste the URL of a search results page instead of just clicking through to the page you want to save. I assume it's the same with Citoid (@mvolz?). (And Google Scholar, one of the more popular search cases, will generally block translation-server anyway.) The main use cases I can think of are 1) offering a selection window with checkboxes to add multiple items, like we do in the client and 2) supporting pages where there's embedded metadata (say, in JSON-LD) describing multiple resources. I'm not sure either of those are that important in the contexts where this will be used (and I'm not sure we'd even want to support saving multiple items at once in the web library due to IP rate limits).

So these are our options:

1) Forget about using Lambda and use sticky sessions for multiple support 2) Use Lambda, and have a config option to return 300 when it looks like the keys are URLs or identifiers and otherwise return 501. When the client makes a selection from a 300, translate those directly. Where possible, translators could be updated to return multiple results that work without state. 3) Use Lambda, and have a config option to disable support for multiple results 4) Use Lambda, and always return 501 for multiple results

For 2–4, we could take a query parameter to force the server to translate the page as non-multiple, for cases where we want to save at least a URL that the user can deal with later.

/cc @adam3smith, @zuphilip, who might have a better idea of the places multiple is used

adam3smith commented 5 years ago

Not sure to what degree I have relevant insight, but in the hope that it's helpful, my two cent:

  1. Citoid definitely doesn't care about multiples. The way it works in the Wikipedia visual editor, they might even be confusing, so that's not a use case.
  2. For ZotBib, I think your assessment is basically correct that it's a pretty rare scenario outside of perhaps google scholar. Here aOne other case I can think of are the landing pages for edited books on Springer (and maybe others) where we currently import just the chapters via multiple, but people may want the book via ZotBib.
  3. Relatedly, if we get rid of multiples for translator-server, we should try to make sure that we have, where at all possible, no translators without a single item import. A recent example is Worldcat Discovery, as discussed in this thread: https://forums.zotero.org/discussion/73791/has-anyone-figured-out-a-new-worldcat-discovery-translator-yet Example catalog entry here: https://gouchercollege.on.worldcat.org/search?databaseList=&queryString=test#/oclc/223885786
  4. Finally, and most importantly imo, what does this mean for the bookmarklet? Does this requires multiples from translator-server? If so, that should weight heavily on the decision I think. Having import work well on mobile devices is quite important.
dstillman commented 5 years ago
  1. Finally, and most importantly imo, what does this mean for the bookmarklet? Does this requires multiples from translator-server?

That's a good question, and helped to clarify our thinking here.

The answer is long and complicated, but the short version is that while we've never used server-based multiple in the bookmarklet [correction: we use it in the Amazon translator and nowhere else], we might want to start doing so, so we're going to go with:

  1. Reuse sessions if they're available, and otherwise retranslate the original URL and then call the callback, hoping that the keys haven't changed.

(We could probably optimize it further by directly translating URLs used as multiple keys instead of retranslating the original URL, but it's not guaranteed that those are directly translatable (though I imagine they almost always are). I also realized that the particulars of how Lambda is implemented probably means you'll usually end up on an instance with a cached session, so the impact of this is lower than I thought.)

zuphilip commented 5 years ago

Another use case for the translation-server is AFAIK the online test results, also I am not sure anyone is looking at them often. However, the plan was to use the translation-server then also for a continuous integration in the translators repository. However, the test cases for multiples just contain the information that the detection has to return "multiple" and nothing about the multiple items itself. Therefore, this use case might be simpler to still support.

The DOI web translator is also a translator which is always returning multiple also the list may only contain one DOI.

CC @mtrojan-ub who might also have a use case for translation-server, but not sure whether multiples are important for that.

mtrojan-ub commented 5 years ago

We do have a use case as well.

We are building up a list of journals with an entry point url for each journal. A url points to a page with recent articles of a single journal. Then we want to scan the page every night using zotero translation server for recent articles. (Filtering articles we already downloaded in earlier runs will be done outside the server). All new articles will then be added to our library index.

(Whenever possible, the entry point url will be an RSS feed which we will analyse outside zotero translation server and then just send the feed's contained article URLs to the server, but we would like to use "multiple" approach with the recent articles page for all journals that do not provide a RSS feed.)

@zuphilip, thanks for calling my attention to this issue!

dstillman commented 5 years ago

Multi-server select is fixed by https://github.com/zotero/translation-server/commit/1b5d429183c01481c5b76f28f4089215670c2cb2