zotero / translators

Zotero Translators
http://www.zotero.org/support/dev/translators
1.23k stars 747 forks source link

Update translators for XHR processDocuments() #1441

Open dstillman opened 6 years ago

dstillman commented 6 years ago

Now that XHR processDocuments() has landed, we should start looking out for translators that need to be updated to work with the new method. This is a tracking bug for those issues, which can reference this issue.

To review, any content that's generated client-side will be unavailable in the document object provided by processDocuments(), which will now show only the initial page structure that came over the wire. However, the fact that client-side JavaScript can generate the page means that we can almost certainly do so as well, and possibly in a more reliable, structured way than we were doing before with page scraping (and definitely much more quickly than waiting for the page to load in a hidden browser). The best way to debug these sorts of pages is to use the Network tab of the browser devtools and monitor API requests that the page is making. In most cases we should be able to duplicate those requests (which hopefully are cached, avoiding additional network requests) and use the same data the page is using. In some cases, pages will instead embed JSON in the page itself, which we can also extract.

Note that the signature of processDocuments() has changed (though the old signature will still work for existing translators). Instead of done and exception callbacks, it now returns a promise. The processor callback can now return a promise as well, which will be waited for before moving onto the next URL. Instead of using doGet()/doPost()/etc., translators should use request(), which returns a promise and properly propagates errors. While I haven't yet tested this, it should now be possible to use async/await in translators, which can make working with promises much easier. If anyone knows of a translator affected by this change, I'm happy to work up an example update to serve as a template (and to verify that everything is actually working as expected).

Also note that this only affects processDocuments(), so current-page saves still have access to generated content. Whether it makes sense to use that directly when available or to keep the multiple code consistent might vary by site (perhaps by whether the API requests are actually cached).

The new processDocuments() is available now in the 5.0 Beta, so it can be tested with Scaffold. It hasn't been rolled out to the connector yet, but it should work the same there. translation-server has always used XHR for this, so any multiple results that fail there might point to a problem (though I'm seeing some timeouts in the current test results that might point to a separate problem).

dstillman commented 6 years ago

Version 5.0.25 of the connector is now out with this change, so we should keep an eye out for multiple breakage on sites that generate content client-side.