Generic webpage translator

dstillman commented 8 years ago

As suggested on https://github.com/zotero/translation-server/issues/32, and further bolstered by https://github.com/zotero/zotero/issues/1059, we should create a translator that saves the basic data (title, URL, access date) on all webpages.

Some follow-up work will be needed in the client to show the gray icon for this translator ID, and probably some other things.

To allow this to be rolled out to 4.0 clients without causing trouble, we should figure out a way to return a value from detectWeb only in 5.0. Not sure if we make the Zotero version available now, but if we want to avoid that (e.g., for other consumers of translators), we could do some sort of feature detection.

(Ideally we could just use a minVersion here, but as far as I know the client won't ignore translators with later minVersions when running detection, which would seem to make a lot of sense.)

zuphilip commented 8 years ago

How about the other idea to extend the EM translator for this case? It looks for me that the function addLowQualityMetadata in the EM translator is similar to what you want to achieve. Thus, it might be already enough to extend the detectWeb in EM to always output webpage as a last case.

adam3smith commented 8 years ago

+1 to @zuphilip 's question. Also, I don't understand why this must be limited to Zotero 5+ -- what am I missing?

dstillman commented 8 years ago

How about the other idea to extend the EM translator for this case?

It's what I say in the other thread: "even in the single-save-button era I still think there's value in setting different expectations for EM and <title/>".

I don't understand why this must be limited to Zotero 5+ -- what am I missing?

Without client changes, the color icon would appear on every page, even for title/URL/accessDate, and there'd be a confusingly redundant set of options in the context menu (which hard-codes web-page saving right now).

zuphilip commented 8 years ago

It's what I say in the other thread: "even in the single-save-button era I still think there's value in setting different expectations for EM and <title/>".

Well, you explained that the different colors serve some purpose. But if the EM translator can extract some data on a page (and it will save also a snapshot of that page), then I cannot think of any use for a lower quality website translator on this page. I guess, that we can also somehow color the icon for EM differently if we are in some low quality data case, maybe just if detectionWeb returns website.

dstillman commented 8 years ago

We can just have the generic translator not show up in the context menu when the EM translator triggers. The point is that we can't distinguish between EM and generic data within a single translator, so they have to be separate.

dstillman commented 8 years ago

It's true that the stuff in addLowQualityMetadata blurs the line here a little bit — I didn't realize the EM translator used keywords and description and even tried to do byline author extraction. It's a little odd to do those things when there happen to be other metadata tags but not do them for generic webpage saving, when those things aren't really related to the presence of the more complex metadata. On the other hand, it's possible that site authors are more likely to populate even the very basic meta tags like keywords and description with better data when they also have more complex metadata, whereas in the absence of more complex tags those basic tags might be very low quality (spammy, ignorant SEO stuff).

So, some options:

1) Add a generic translator but keep it limited to what we do now (title, URL, access date), and keep the gray/color distinction.

2) Copy some of that logic — stuff we extract from the page in EM but that doesn't alone trigger EM detection (description, keywords) — to the generic translator and keep the gray/color distinction. Some generic pages would start having (potentially very low quality) tags and abstracts.

3) Trigger EM on all pages and show the gray icon on all pages that return 'webpage'. Despite the gray icon, some saved pages might include very high quality metadata.

4) Trigger EM on all pages and show the blue icon on all pages that return 'website' (so no more gray icon anywhere, except maybe on non-HTML documents). Despite the blue icon, some saved pages might include nothing other than title/URL/accessDate.

dstillman commented 8 years ago

@simonster points out that the webpage item type doesn't have a lot of metadata available anyway, and the quality is often bad even when EM detects (e.g., here on GitHub). Even with EM, we're pretty much talking author and date at best. So this seems like a decent argument in favor of (3).

Here's I think what (3) would involve:

1) Renaming "Embedded Metadata" to "Webpage"

2) Changing init in EM to return 'webpage' as fallback in all cases

3) Changing the client to show a gray icon for website, at least from EM. Not sure if we would show gray for non-EM translators that return website. I was inclined to say yes, since it highlights that we have specific support for a site/platform, but in an ideal world every site would just embed metadata, and then we'd be left with the same inherently limited website data

4) Showing a "without snapshot" option in the context menu for this translator, or perhaps rethinking how we handle the "[with/without] snapshot" context menu options in general

One potential future complication: when we support JSON-LD, and specifically multiple JSON-LD blocks on the page, the translator would return multiple, which would remove the webpage option. This probably isn't a huge deal, but it would mean that there wouldn't be a good way of saving a straight webpage in those cases. But this seems sufficiently outweighed by all the other benefits here (e.g., to get webpage saving for free in the bookmarklet and translation server).

avram commented 8 years ago

Since EM would still be able to detect non-webpage content, I'm not sure renaming it to Webpage makes sense.

dstillman commented 8 years ago

Hmm. That's fair, though it's the translator name, so it's saying that it's saving using the Webpage translator (i.e., extracting generic data from the webpage), not that it's saving as a webpage (which the icon indicates). But maybe overly confusing. "Embedded Metadata" is a bit technical to show on all pages, though. Best option might be to just not show anything in parentheses for this translator, since it'll be the default saving mode (and the default icon as well).

zuphilip commented 8 years ago

I think your option 3) is a good choice!

@simonster points out that the webpage item type doesn't have a lot of metadata available anyway, and the quality is often bad even when EM detects (e.g., here on GitHub).

Yes, I agree with that. Therefore, it makes IMO sense show the gray icon for these low quality data, which are maybe just useful enough to save some urls for later reading (bookmark functionality) but usually one has to cite more reliable sources than just webpages.

One potential future complication: when we support JSON-LD, and specifically multiple JSON-LD blocks on the page, the translator would return multiple, which would remove the webpage option.

Well, we can see this clearer if we have some ideas of a JSON-LD translator. In general I think it is a good idea to have the possibility to use Zotero also as a bookmark tool and therefore any handy one-click option to capture the website (as one item) is appreciated.

Best option might be to just not show anything in parentheses for this translator, since it'll be the default saving mode (and the default icon as well).

I.e. simply Save to Zotero (with snapshot) and Save to Zotero (without snapshot). That is a good idea. Alternative, we could think about name as Save to Zotero using "Website Data", Save to Zotero using "Web Data", Save to Zotero using "Generic", Save to Zotero using "Default".

dstillman commented 8 years ago

See also https://github.com/zotero/translators/issues/686, which suggests that DOI should go in this too. https://github.com/zotero/zotero/issues/1110 is an interesting test case.

adomasven commented 7 years ago

So I'll be working on this, as per @dstillman's comment

Renaming "Embedded Metadata" to "Webpage"

Changing init in EM to return 'webpage' as fallback in all cases

Changing the client to show a gray icon for website, at least from EM. Not sure if we would show gray for non-EM translators that return website. I was inclined to say yes, since it highlights that we have specific support for a site/platform, but in an ideal world every site would just embed metadata, and then we'd be left with the same inherently limited website data

Showing a "without snapshot" option in the context menu for this translator, or perhaps rethinking how we handle the "[with/without] snapshot" context menu options in general

noting the following:

Let's keep the name, but in the connector display "Save to Zotero" without translator name. The name is more descriptive for translator creators and users never have to see "Embedded Metadata" if it's the default translator.
Add some additional handling code within the connector to allow saving with and without snapshot for the EM translator.

There have been suggestions to incorporate COInS and DOI into EM, but I would like to leave that up to someone else as there are additional considerations, like what happens with the translators (if any) that use both COInS and EM for initial metadata.

adomasven commented 7 years ago

Ok, so a problem with the above approach is that if EM always returns at least webpage, then it will always overshadow the DOI translator. We could change the priority of EM back to 400 (see discussion), but it was moved above DOI for a reason. Which means that incorporating DOI into EM is inevitable.

I understand that we always return multiple for DOI translation. Is it only to verify the data or does DOI translation sometimes genuinely have multiple items? Any suggestion on how/whether this could be reasonably handled?

adomasven commented 7 years ago

Yep, so at least for some of the DOI test cases the select dialog contains multiple entries, with only one of them corresponding to the actual article being saved. Potential options:

Display a select dialog for saves that include DOI and ask the user to select the relevant entry, if any, but that is crude and potentially confusing. Like a twisted captcha for translation. We might want to disable DOI translation for translation-server.
Keep the DOI translator separate with a lower priority. For pages with DOIs present users would have to manually select translation with DOI from the context menu.

adam3smith commented 7 years ago

I think 2. is the way to go. The cases where you do want to use DOIs as multiples are often for fairly sophisticated use (e.g. importing all references from an article you're looking at in html) -- but as that example shows, it's also a really useful feature.

zuphilip commented 7 years ago

Agree with @adam3smith and the example http://libguides.csuchico.edu/citingbusiness shows that we already preferring EM over DOI in "sparse" examples. (Technically, I guess it would also be possible to call DOI translator from EM translator if this case happens, but this might be more fragile code...)

adam3smith commented 7 years ago

I thought that was the idea of combining ? For single-DOI cases, call DOI in EM with some heuristic for making sure we're looking at the same item, then merge data. Same for COinS, which can also have multiples.

dstillman commented 7 years ago

I'm a bit confused about the argument for (2). DOI being the only available translator is fairly common, so we wouldn't want to start preferring a generic webpage in that case. Even if we kept it separate for multi-DOI cases but integrated it into EM for single-DOI cases, a search results page with multiple DOIs and no other real metadata would start offering a generic webpage as the main option, which is worse than the current behavior. I think the only real solution is to integrate DOI (and COinS, and JSON-LD eventually) into EM and decide what to do based on what's available.

So this is a bit radical, but working through some scenarios and optimal behavior, it seems we need to allow a single translator to provide multiple save options. This is how the EM translator could pass webpage options, including snapshot/no-snapshot, with or without color and before or after its other save options as appropriate. (We could still alter the display order of the snapshot options based on the client pref, but we wouldn't need to do most other special-casing for the EM translator.) There are also various scenarios where the EM translator could intelligently decide which options to offer, whereas relying on multiple translators based on priority is much more limited, would result in redundant, confusing, inferior secondary options (e.g., a "DOI" menu option that only used CrossRef when the save button was already combining data from the page and from CrossRef), and would require special-casing for the placement of various options (e.g., putting the generic webpage options last).

We could allow returning an object (instead of a string) to specific a different label, including an empty one, which, among other things, would avoid the need to special-case the EM translator to remove the label and let us instead intelligently label based on how it was actually doing the save (since "DOI" or "Embedded Metadata" or "COinS" would sometimes be nice to show).

Finally, this could obviate the need for various translator hidden preferences and make those options much more accessible (e.g., Nature Publishing Group (with supplemental data)).

So for EM, detectWeb() could return an array like this:

[
  'journalArticle',
  {
    label: 'DOI',
    icon: 'multiple'
  },
  {
    label: 'Web Page with Snapshot',
    icon: 'webpageGray'
  },
  {
    label: 'Web Page without Snapshot',
    icon: 'webpageGray'
  }
]

which would result in a button with Save to Zotero (Embedded Metadata) and a journal article icon and a menu with Save to Zotero (Embedded Metadata)/journalArticle, Save to Zotero (DOI)/multiple, and two gray webpage options.

doWeb() would be called with the chosen index, including for the snapshot options.

With that in mind, some example scenarios:

Page has a non-generic translator, embedded metadata for non-webpage, no DOIs

Item type icon via non-generic translator, EM item type in menu, EM gray webpage options in menu

Page has a non-generic translator, embedded metadata for webpage, no DOIs

Item type icon via non-generic translator, EM color webpage options in menu

Page has single non-webpage embedded metadata and multiple DOIs

Item type icon, DOI selection in menu, gray webpage options in menu — all from the EM translator. As a single translator, doWeb() could resolve the first DOI and combine metadata from EM and CrossRef.

Page has no embedded metadata but multiple DOIs

Folder icon via EM translator, gray webpage options in menu. In doWeb(), resolve first DOI, if DOI seems to match page, just treat as regular DOI list. Otherwise, first entry in select dialog is current page using generic info (title, URL, access date) and DOI for the rest of results. As a single translator, it allows saving of the generic page info (potential improvement over status quo) and avoids showing a gray webpage icon even though there might be a DOI for the main item on the page (which would be a regression from status quo).

Page has single-item embedded metadata that returns webpage and one DOI

Folder icon via EM translator, color webpage options in menu. In doWeb(), resolve DOI, if DOI seems to match embedded metadata, combine (which probably means using only CrossRef). Otherwise display select list with first entry from embedded metadata and resolved DOI as second entry. (For the first case, a little weird to save straight from a folder, but why show two entries when we know one is worse and why show one entry if we're sure it matches the current page?) As a single translator, it avoids saving a webpage item when there's better metadata available as DOI, which is an improvement from current behavior where EM translator is prioritized over DOI.

Page has single-item embedded metadata that returns something other than webpage and one DOI

Same as previous, but optimistically show an item type icon from the embedded metadata. Combining metadata (when resolved DOI matches embedded metadata) might just mean adding an abstract from the embedded metadata to supplement CrossRef data.

Page has single-item embedded metadata that returns webpage and no DOIs

Color webpage icon, color webpage options (snapshot/no-snapshot) in menu, no gray options

dstillman commented 7 years ago

Another thing we could do: ISBN detection that only ever showed as a folder in the menu and was never offered as a primary method, for the reasons @adam3smith explains in that thread.

adam3smith commented 7 years ago

I'm convinced by that rundown. The only one that's a bit wonky, (no metadata, multiple DOIs) is a bit weird currently, too, and the proposed solution is a slight improvement. COinS should likely work exactly the same way.

adomasven commented 7 years ago

In doWeb(), resolve DOI, if DOI seems to match embedded metadata, combine (which probably means using only CrossRef). Otherwise display select list with first entry from embedded metadata and resolved DOI as second entry.

Do you have any suggestions for how the "seems to match" check would be performed in JS, considering we only had very low quality metadata before DOI lookup? Some sort of fuzzy matching is needed, but this would mean involving a third-party library and showing false-positives first would be a rather bad experience.

In general one of the reasons we wanted a generic translator (and why I specifically decided to work on this now) was to remove special-casing in Zotero, connector and translation server codebases for pages that miss translators and leverage the existing code to provide generic saving in all instances. However the plan outlined is actually in opposition of at least the simplification goal and will take a non-trivial amount of time and effort to implement and roll out within the translators and translate software and make translation server client handling more complicated too. Having the above working would be great, but I wouldn't want to commit myself on a change this big.

Having said that, I propose a less elegant and efficient, but much simpler solution:

If EM only contains webpage data, run DOI translator's detectWeb within EM and if there are DOIs present, return undefined, otherwise return webpage(Gray?).
Change special-case code within the Connector to always allow running the EM translator. If EM returns undefined then list EM options with and without snapshot as the final two options in the context menu, otherwise list the EM translator according to its priority.

If there are DOIs present, EM will not overshadow the DOI translator and otherwise will overtake. If both rich EM data and DOI are present then both can co-exist. This way we can avoid any changes or special translation handling within zotero and translation server and have a translator for every page. It sacrifices code clarity in the intermediate term, but is a workable solution for the short-term until someone has the time and spirit to commit to the bigger change.

dstillman commented 7 years ago

I think yours is a good interim plan. Mine will let us remove almost all special-casing and also provide better results, but it will definitely take some work to get there, and might makes sense as part of a larger reworking of the translator architecture (e.g., to use promises everywhere). I'll probably work on that at some point.

webpage(Gray?)

webpageGray doesn't exist now, so we'd have to add that, but as long as we're still special-casing, we can just use a gray icon whenever the EM translator returns webpage or undefined, and then we wouldn't risk problems in translation-server or elsewhere before we add proper support for webpageGray. I think we can still use the color icon for non-EM webpage results — even though, as noted above, metadata for webpage is really limited, it's probably worthwhile to show that we're doing something site-specific and that the EM options are still available in the menu.

So the user-facing changes here will be that 1) you'll see the blue webpage icon much less often and 2) the gray icon and webpage menu options will start showing more data. And translation-server will be able to save all webpages.

mrtcode commented 5 years ago

I was initially looking to how we could utilize the linked alternative metadata sources in EM translator #77, and even checked all URLs from translators tests. And almost none of them are linking to any MODS or MARCXML metadata. Which means that's a very rare thing. But of course we should add this when we start reworking EM translator.

What is magnitudes more important, in my opinion, is a generic translators fallback to DOI. There are many web pages where generic translators produce very poor metadata (especially the EM translator), but COinS and Open Journal Systems are guilty too. And that seems to be the case for less known, less maintained, often non-English web pages.

Translating all those URLs results to poor metadata, though they could return a nice Crossref metadata instead:

The current DOI translator isn't performing at its best too. Firstly, as you know, for most websites we could immediately return the current article metadata instead of showing a selection dialog. And secondly, if there are many DOIs in a page, they are resolved and presented in a random order, even though the article DOI was the first in the page. Also it would be useful to utilize Crossref REST translator, because it can select the required fields and get multiple DOIs in one query, instead of bombing Crossref with requests for each DOI.

So the steps would look like this:

Check if a generic translator produced metadata is incomplete. I.e. missing authors, missing basic fields, etc.
Extract all DOIs from the page
Retrieve metadata for all DOIs
Try to automatically match which DOI belongs to the current article, otherwise just show a selection dialog as we do already with DOI translator
Maybe try to combine generic translator metadata with DOI metadata, but in my experience generic translators metadata is either very good or very bad.

But the main question is how we would combine those generic translators.

dstillman commented 5 years ago

Re: #77, it's rare now, but the goal would be for that to become a standard, trivial option we can recommend to sites that want to expose metadata, replacing unAPI (which is basically defunct).

Retrieve metadata for all DOIs

I think we'd want to optimistically retrieve metadata for just the first DOI, in the hope that it matched, before retrieving all the others. (And we'd do this only on doWeb, so we'd have to decide what to show before that. E.g., if there's no other translator or metadata and we detect one or more DOIs in the page content, we would need to show either the folder icon even though we might save an item automatically or show journal article even though we might open Select Items. I think the former is better.)

Fixing the order for DOIs in Select Items certainly sounds good. (I had no idea those weren't in page order.)

With a combined translator and auto DOI matching, we'll also need a translation flag to force use of Select Items for DOIs even when one matches the page, so that we can still offer a DOI option in the context menu.

For the question of combining metadata, I think we basically want to get metadata from available identifiers and then add in anything we can from page, with the assumption that any field that actually comes from the identifier is more likely to be correct. @adam3smith and @zuphilip may have counterexamples, but I think the main reason EM had priority over DOI was because of the multiple issue, and if we're resolving the first DOI to try to match the page, that stops being an issue and we can just use the DOI metadata as the foundation.

I think the most common cases will be supplementing abstracts, creators, and tags, and obviously attachments. If there are other fields from the page that aren't in the identifier metadata…not sure.

See also #686, which preceded this.

mrtcode commented 5 years ago

Actually, it seems that EM translator is used as a base for 105 other translators. Which are used for the most important journals. Which makes me think that we should go to the opposite direction with EM translator. If we are using it as a dependency for other translators, it should be as simple and predictable as possible. Just like a node.js library - you want to focus it to a specific task, and not to run logic for your whole application.

So, I would suggest to add another translator that would do all the smart logic. Let's call it for now the ultimate translator. It would be called when site-specific translators fail. That translator wouldn't do any metadata extraction on it's own, but instead it would call all other generic translators (EM, COinS, DOI, unAPI, etc.) and intelligently combine metadata while also deciding if the result is single or multiple items.

Well, there are a few things that the ultimate translator could extract on its own:

Move the addLowQualityMetadata from EM to ultimate translator, or at least getAuthorFromByline because this approach is too inaccurate for a translator on which the most important translators depend, in my opinion.
Do generic web page metadata extraction for title, URL, access date.

It probably not worth to merge COinS into EM, because the former can return multiple results while the later only single.

DOI translator can probably still exist as a standalone translator, to keep an option to use it separately.

And generally, the site-specific translators should be developed in the way that if they can't confidently extract metadata, or some important fields are missing, the translator should fail and give the way for the ultimate translator.

So the updated steps how the ultimate translator could work:

Get metadata for the first DOI and try to match the article
If not then try to get metadata for all other DOIs and try to match the article
Get metadata with EM
Get metadata with COinS
Get metadata with unAPI
Do low quality metadata extraction
Combine metadata from all sources
Decide if the result is single or multiple items

mrtcode commented 5 years ago

Also I am quite concerned about translators like Open Journal Systems that just wrap EM translator and sell this as a good thing. I.e. https://journals.sfu.ca/flr/index.php/journal/article/view/137 returns a very poor metadata and doesn't give a way for other translators.

adam3smith commented 5 years ago

but I think the main reason EM had priority over DOI was because of the multiple issue

Mostly, though EM has other advantages, most importantly the ability to attach PDFs (slightly less important now bc of unpaywall integration, but still a real advantage). But by wrapping the various generic translators, we still get the best of both worlds, so this seems moot.

So, I would suggest to add another translator that would do all the smart logic. Let's call it for now the ultimate translator.

Apart from the name ;) I like this idea. Adding all of the logic into a translator that then gets called by other translators is indeed risky.

Also I am quite concerned about translators like Open Journal Systems that just wrap EM translator and sell this as a good thing

The translator performs a number of other improvements, including trying to get at the abstract, alternate modes for getting the PDF, import from the PDF page. OJS standard installations include pretty good metadata in the header (it's not clear to me why that journal doesn't), so I think overall it's worth it.

dstillman commented 5 years ago

So, I would suggest to add another translator that would do all the smart logic.

How would we deal with that OJS example, then? If lots of translators use EM but also do things like get the PDF (which doesn't happen here but could), they can't just fail on low-quality metadata (and we wouldn't want to have to code that into each translator anyway).

One option might be some threshold of data quality where we run the combined translator even after a site-specific translator is successful, separate from the usual fallback logic, and try to reconcile the data.

mrtcode commented 5 years ago

OJS isn't a site-specific translator. And there aren't many translators that would be like this - to use EM and wouldn't be site-specific. Site-specific translators that use EM seem be to ok. Maybe because they are more tightly adjusted to specific site and if something is wrong with the page they fail anyway because, for example, can't extract the required HTML tags. And we can easily check site-specific translators tests, what is not possible for non-site-specific translators.

But setting some cutoff threshold for EM translator is an option too. I.e. if even authors are missing, what can we expect from that metadata? Or maybe it's just enough to make those few translators more strict.

To run the combined translator (ok it's a better name than 'ultimate'), even if the previous translators succeeded, we would need to modify the code that handles translators. We can consider this, but in practice this shouldn't be necessary.

dstillman commented 5 years ago

But my point is that making even a small number of translators (or the EM translator in general) more strict to force fallback isn't the right fix, because we still want anything else that the site-specific (by which I just mean non-generic, not that it's tailored to a specific site) translators can provide, which could include a PDF, tags, etc. (In this case it doesn't for some reason, but it could.) (Granted, if this one fell back to DOI, that would also get the PDF, because it's OA, but this problem could happen with a gated journal too.)

So the only real way to fix this would be by inspecting the data in the translator framework and running the combined translator if necessary (and the data came from a higher-priority translator). (OK, technically we could probably modify all such translators to do an explicit fallback to the combined translator, but that's not realistic, and this should be a general fix.)

adam3smith commented 5 years ago

I honestly think this is fairly rare and we can handle this individually in the respective translators: We should just be mindful of this issue for multi-domain translators using EM and either a) make detect stricter or b) where we have elements that we really want to capture, include logic in the translator itself that falls back on the combined translator if needed (maybe that's what Dan was referring to above)

dstillman commented 5 years ago

From the forums, here's an example where DOI gets better metadata overall but EM gets the PDF, Abstract, and a better date: http://www.sdewes.org/jsdewes/pid6.0223

zuphilip commented 5 years ago

Actually, it seems that EM translator is used as a base for 105 other translators. Which are used for the most important journals.

It might be useful to look closer at this depending translators: https://github.com/zotero/translators/search?q=951c027d-74ac-47d4-a107-9c3069ab7b48&unscoped_q=951c027d-74ac-47d4-a107-9c3069ab7b48 . I just clicked on a few and two things become clear (also I suggest to try to look closer at all these depending translators and not just a sample):

The current EM translator is often a good start for a site. It then may need some small adjustments, e.g. a XPath for the abstract, how to link the PDF, save constant data about the journal/newspaper like ISSN, format the date better.
Second, the detection for multiples cannot been done with EM and therefore this is then done in an individual translator. Moreover, the detection by EM may be wrong/too generic which can be improved with the information about the specific site.

Thus, EM is currently used as a generic way of extracting bibliographic data from website (possibly tweaked a little with an specific translator).

If we are using it as a dependency for other translators, it should be as simple and predictable as possible. [...] So, I would suggest to add another translator that would do all the smart logic. Let's call it for now the ultimate translator. It would be called when site-specific translators fail. That translator wouldn't do any metadata extraction on it's own, but instead it would call all other generic translators (EM, COinS, DOI, unAPI, etc.) and intelligently combine metadata while also deciding if the result is single or multiple items.

No, actually I would expect in your scenario that most/some of the dependent translators would then need to be based on this new ultimate/merging translator. This would be then the same as the current situation, but with another intermediate step.

Get metadata for the first DOI and try to match the article

Okay, that sounds fine and can possibly fix some currently problematic cases.

I.e. https://journals.sfu.ca/flr/index.php/journal/article/view/137 returns a very poor metadata and doesn't give a way for other translators.

This OJS instance is not giving much more machine readable/guessable information, also OJS makes this in general very easy: there are mandatory plugins for OpenURL, DublinCore, MODS and I guess they just have not enabled the Dublin Core Indexing Plugin.

One main drawback IMO for EM is currently the lack of JSON-LD and other variants of schema.org. I tried to work on these but currently my time does not permit to continue here...

As for the order given above:

If there is unAPI or a <link> to MARCXML or MODS then I would expect the quality to be higher than anything else.
I am skeptical that we can just rank the different methods for the optimal result. If you look closer at the RDF translator you see that for each field several options are considered in an order we expect to be the best. In this way it is e.g. possible to take the Abstract and PDF from DublinCore but the other fields from HighWire meta tags.

mrtcode commented 5 years ago

No, actually I would expect in your scenario that most/some of the dependent translators would then need to be based on this new ultimate/merging translator. This would be then the same as the current situation, but with another intermediate step.

But why would we want to base other translators on this new combined translator? I think the combined translator should only be used when a site-specific translator fails, and never used as a dependency (except maybe in that case with multi-domain OJS). I imagine it would be different from what we're regularly calling as translators, and it would be more like a logic that decides what to do with the web page if there isn't (or failed) the site-specific translator.

But my point is that making even a small number of translators (or the EM translator in general) more strict to force fallback isn't the right fix, because we still want anything else that the site-specific (by which I just mean non-generic, not that it's tailored to a specific site) translators can provide, which could include a PDF, tags, etc. (In this case it doesn't for some reason, but it could.) (Granted, if this one fell back to DOI, that would also get the PDF, because it's OA, but this problem could happen with a gated journal too.)

I don't think we need to fix site-specific translators metadata problems by using the combined translator. If a site-specific translator is implemented, it should be better by default, because translators authors should know what they are doing and find the best way to extract metadata, even if need to additionally get metadata by DOI. And they can use all the same methods that are used in the combined translator.

Also the output of site-specific translators can be controlled with tests and the problems should also be fixed within the same translator. Therefore I agree with @adam3smith that making a few translators stricter could be a solution.

From the forums, here's an example where DOI gets better metadata overall but EM gets the PDF, Abstract, and a better date: http://www.sdewes.org/jsdewes/pid6.0223

The combined translator would successfully extract the correct metadata from that URL. But let's imagine that someone decides to make a translator for that URL. If so, then the translator would have to combine metadata from EM and DOI to have the same quality metadata. And to base on the combined translator wouldn't be a good idea, because it's going to do too much magic. Therefore if the translator author sees that EM returned an item with a missing ISSN or an imprecise date, those should be extracted either from the page or by DOI.

Also we are discussing about adding MODS, MARCXML, JSON-LD to EM translator, but what if the page has multiple items? EM is single item only.

dstillman commented 5 years ago

Also we are discussing about adding MODS, MARCXML, JSON-LD to EM translator, but what if the page has multiple items? EM is single item only.

I would think that retrieval based on <link> would go in a separate Linked Metadata translator called from the combined translator, similar to unAPI, not in EM. But in-page JSON-LD might go in EM, in which case it would need to possibly handle multiple items. Do you mean that it'd be a problem in terms of EM being called from other translators that expect a single item?

mrtcode commented 5 years ago

EM is currently designed to only return single item and all dependent translators also expect single item. And yeah, I'm thinking how that influences other translators.

Also if we start advising people to use MODS/MARCXML, we should expect translators that are wrapping that Linked Metadata translator and improving some fields, just like EM is used now in other translators.

dstillman commented 5 years ago

all dependent translators also expect single item

That's not true — it's just a callback on itemDone, which can run more than once (the same way that, say, the Google Scholar translator can call an import translator like RIS and save more than one item). I'm not actually sure what happens now if a child translator calls selectItems(), but there's a good chance it just triggers the usual selection window.

mrtcode commented 5 years ago

all dependent translators also expect single item

That's not true — it's just a callback on itemDone, which can run more than once (the same way that, say, the Google Scholar translator can call an import translator like RIS and save more than one item). I'm not actually sure what happens now if a child translator calls selectItems(), but there's a good chance it just triggers the usual selection window.

I was thinking about cases like this where translator actually trusts that it gets a single item, because otherwise it would add the same abstract to all items, which wouldn't make sense. Of course if a website for which the site-specific translator was implemented has only one item, why it should ever return multiple items.

Anyway, if we are adding JSON-LD which can return multiple items, logically, we should add COinS too, which can also return multiple items. But again, I am trying to understand what will be the consequences of making EM as a multi-items translator.

Also adding JSON-LD and COinS to EM means there must be a logic in EM translator that combines metadata if multiple methods exist. Also what if RDF returns single item and JSON-LD or COinS returns multiple?

mrtcode commented 5 years ago

I'm thinking that Embedded Metadata translator name is maybe a little bit confusing and sets our thinking on the wrong path. Partially because we always imagined it as a last resort generic translator that has all the methods inside to extract metadata, and partially because it's used in many other translators and we want it to automatically extract as much metadata as possible to allow translators developers to just update a few last fields.

But let's imagine what would happen if Embedded Metadata translator would be renamed to something more narrow like Meta Tags Translator. It would be just like a regual non-site-specific translator. I.e. COinS.

So I think all translators should be separated:

Meta tags translator (previously Embedded Metadata) [single]
Linked metadata [mostly single, but can be multiple]
JSON-LD [multiple]
Microdata [multiple]
COinS [multiple]
unAPI [multiple]
DOI [multiple]

A few more reasons why not to merge any other translators into the EM translator:

Although it would be convenient for translators developers to have a translator that does it all, but at the same time they lose the option to manually choose metadata sources and fields.
Combining metadata from multiple sources will look like a "black box" for depended translators developers.
Combining single and multiple metadata is also complicated.
No reason to combine only some (i.e. JSON-LD) translators to EM and leave others. Better keep them all separate and simple.
Each translator has its own nuances i.e. COinS is sometimes querying Crossref OpenURL API
Making EM translator multi-item can result to unexpected behavior for parent translators that expect only one item.

So my suggestion is to keep all translators separate, use them in site-specific parent translators separately, and then introduce a combined last resort translator that intelligently uses all the previously listed separate translators.

The combined translator wouldn't be used in any other translator, except maybe in multi-domain translators, because they can't control their output quality with tests, but are dangerous to block the combined translator. In that case the combined translator could be invoked with the already extracted metadata, which will be utilized too.

dstillman commented 5 years ago

I think that makes a lot of sense.

Only somewhat related, but one general concern I have is that, traditionally, we've been pretty complacent about the data available on a given site — we've mostly just accepted that what's there is the best we can do, even if some fields are missing. It would be nice to figure out ways to make sure we're getting as much data as possible, even if it means using other services. I don't think it's realistic to solve that purely by convention and tests (e.g., by using the DOI translator as a dependency more liberally, though we can do that too), and I still think we may want to consider certain thresholds or rules that trigger automatic supplementation of the data when possible.

mrtcode commented 5 years ago

Well, that sounds similar to what we are trying to do with zotero/zotero#1582.

If we trust that translators are already doing their best to extract metadata from the page, there is no need to perform any additional generic translation for the page. So the only thing that is left is to utilize identifiers to retrieve metadata from additional sources, what are we doing in zotero/zotero#1582:

Resolve an identifier with our resolver API (currently only DOI), if there isn't any
Get metadata by identifier (other ids besides DOI have limited querying capabilities)
Get metadata from publisher website if we are not translating it already (resolve the publisher URL over doi.org)
Combine metadata

And actually the combined translator will have some similarities with the metadata update logic in the client. I.e. it gets metadata by an identifier (DOI) and combines metadata. I'm a little bit concerned about duplicated operations in some situations. For example if user manually triggers metadata update in Zotero client, and the combined translator takes over, the metadata will be fetched from DOI RA and combined two times - one from the combined translator, and another from the metadata update logic in client. It would be nice to somehow converge both logics.

We were previously discussing about automatically triggering the metadata update logic when saving items over Zotero client lookup dialog or connector, but I think the conclusion was to proceed with the manually triggered metadata updating and see how it performs.

We had concerns about leaking our usage stats and querying some identifier APIs too often.

I'm also concerned about Zotero connector/bookmarklet and cross-origin requests. What are our limitations here?

mrtcode commented 5 years ago

I'm waiting for any suggestions how we could improve the generic metadata extraction, but if no one opposes I'm starting to implement the roadmap below. And of course everyone is welcome to work on any part too.

Update Embedded Metadata translator:
- Make sure it's only extracting from meta tags and isn't doing anything what is beyond its scope like addLowQualityMetadata
- If some site-specific translators are depending on addLowQualityMetadata result, fix them
Update DOI translator:
- Extract DOI from the web page URL
- Return results in the original order
The combined translator:
- Set its priority higher than any other generic translator i.e. EM, COinS, unAPI, DOI, etc.
- Do detection and use generic translators to extract metadata:
  - DOI translator:
  - Optimistically get metadata for the first DOI (from URL or body) and try to match the article, otherwise get metadata for all other DOIs and try again
  - Utilize Zotero.Utilities.levenshtein plus some additional magic to match DOI metadata with web page title and maybe some other fields
  - Embedded Metadata
  - Linked metadata
  - JSON-LD
  - Microdata
  - COinS
  - unAPI
- Run 'addLowQualityMetadata' which would be added from EM, plus maybe some additional logic like automatic abstract extraction, etc.
- Combine metadata
  - Use DOI metadata as a base
  - Use other translators metadata to fill empty fields, or replace when we can detect that the specific field is better
- Automatically decide if the final result should be single or multiple
Implement linked metadata translator:
- Get metadata from various sources
- Combine metadata field by field, take inspiration from RDS translator
Implement JSON-LD translator
For multi-domain translators add a fallback to the combined translator

The improvements will be made in steps, and for the beginning we basically just want to wrap DOI and the EM translators with the new combined translator.

As soon as the combined translator will wrap other translators, I will use its output to collect and compare metadata from all URLs in translators tests. This will allow to review how metadata differs between various translators and should give a better idea how to combine metadata from different translators.

zuphilip commented 5 years ago

I agree that it is cleaner to have separate translators and one combining translator. However, I cannot say on what part of the EM translator (meta tags, microdata, low quality data, ...) the currently 100+ dependent translators are depending on nor what this would mean to change in the future. Maybe you can help me to answer some questions around that aspect:

Can we do the same things we can do currently in dependent translators?

For a dependent translator I would then still be able to call any of the new separate translators or possibly more than just one. However, you said that I should usually not call the merged translator, but the addLowQualityMetadata is in there only. If some of the data have to been added manually as well in my dependent translator, then I possibly have to add similar steps as in the addLowQualityMetadata function into my dependent translator. Is that correct? Is this then a possible code duplication?

Can we do the same things in a dependent translator with some easy code?

I could imagine that I would need for a website a specific translator for the multiples and for most of the metadata I can then use a mixture of JSON-LD, meta tags, and Microdata. Then I possibly need to call all three translators, e.g. in a nested way:

function scrape(doc, url) {
    var translatorEM = Zotero.loadTranslator('web');
    translatorEM.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48');//Embedded Metadata
    translatorEM.setHandler("itemDone", function(obj, itemEM) {
        var translatorJSONLD = Zotero.loadTranslator('web');
        translatorJSONLD.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48-jsonld');//Embedded Metadata
        translatorJSONLD.setHandler("itemDone", function(obj, itemJSONLD) {
            var translatorMICRODATA = Zotero.loadTranslator('web');
            translatorMICRODATA.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48-microdata');//Embedded Metadata
            translatorMICRODATA.setHandler("itemDone", function(obj, itemMICRODATA) {
                /*
                combine here itemEM, itemJSONLD, itemMICRODATA 
                and/or add some site-specific data
                */
                itemMICRODATA.complete();
            });
            translatorMICRODATA.translate();
            itemJSONLD.complete();
        });
        translatorJSONLD.translate();
        item.complete();
    });
    translatorEM.getTranslatorObject(function(trans) {
        trans.itemType = "newspaperArticle";
        trans.doWeb(doc, url);
    });
}

Or is there a much easier way to do the same? Do all these nesting things here work? I remember some problems with EM being called from other translators (Sandboxing hell??), but maybe they are solved. Besides from the feasibility, this code is IMO quite difficult to work with. Could we possibly do some helper functions maybe in Zotero.Utilities for such cases?

(I hope it is okay that I play here the devil's advocate with my questions. If you think that is not helpful, then you can also let me know.)

zuphilip commented 5 years ago

No reason to combine only some (i.e. JSON-LD) translators to EM and leave others. Better keep them all separate and simple.

Some vocabularies like Dublin Core or schema.org can written either as meta tags, microdata or JSON-LD. There is the different syntax which could be handled by separate translators, but the same semantics (e.g. assign DC.title to title field in Zotero) which should be reused.

dstillman commented 5 years ago

Yeah, I'm not sure removing addLowQualityMetadata from EM makes sense. That includes literal <meta> tags like author and keywords and even some OG tags, which seem like they should be extracted along with the other stuff. The byline extraction based on arbitrary classes (byline and vcard) seems like a potential candidate for moving to a utility function that could be called explicitly by other translators, including the combined translator.

Re: nesting, we're developing this on a branch where we have async/await support in translators (though we still need to figure out how network requests should work, and I'm going to try to make let items = await translatorJSONLD.translate() work for child translators. You should even be able to create multiple translator objects and do something like let itemArrays = await Promise.all(translators.map(t => t.translate())) to benefit from parallel network requests.

dstillman commented 5 years ago

Some vocabularies like Dublin Core or schema.org can written either as meta tags, microdata or JSON-LD. There is the different syntax which could be handled by separate translators, but the same semantics (e.g. assign DC.title to title field in Zotero) which should be reused.

Specifically, they would all just forward to RDF.js, like EM does now. We discussed this previously in the context of JSON-LD.

mrtcode commented 5 years ago

The combined (actually named it "Generic.js") translator is functioning and I am currently testing it with journal articles from 3.5k unique publishers.

So the goal is to make this translator intelligent enough to automatically decide if it's returning single or multiple items. But it's quite challenging to do in the generic way.

In past, Zotero was automatically using DOIs from the page, but the decision was to change that because the translator never knows if the DOI belongs to the current article, search results, references or the next article in the journal. But actually the same problem applies for JSON-LD, COinS, unAPI, Microdata. You are never sure if the metadata is describing the item in the current page or something else.

The following ways are used to detect if the current web page is representing a single item: 1) There is a single DOI in the URL 2) There is Embedded Metadata (it's in HEAD and always means single item) 3) There is a DOI in Embedded Metadata result 4) Linked Metadata (in HEAD. if it would be in BODY then it's a different story) is also always representing a single item 5) Item from JSON-LD, DOIs, COinS, unAPI, Microdata (not implemented yet) that is matched with title from document.title or in some circumstances from H1, H2, H3

To simply put, all metadata in HEAD is representing a single item (except JSON-LD, unAPI), and all metadata in BODY can be representing single or multiple items - but you never know that.

So not only the extracted DOIs items are matched against the page title, but also all the other metadata too, where we can't assume that it undeniably represents a single item.

And then the combined translator cross matches, deduplicates and combines item metadata from different translators.

JSON-LD Now a few thoughts regarding JSON-LD. The translator is working. It transforms JSON-LD to RDF, and does that without any library therefore supports only compacted JSON-LD format, but it was working fine with all the websites I encountered, even though it's totally not according the standard. json-ld library was 20K lines and the current JSON-LD to RDF code is 50 lines.

I know we are considering to recommend people to expose metadata in this format, but I see huge problems with it:

1) JSON-LD can contain nested metadata with sophisticated relations and many different ways to represent it. While our JSON-LD to RDF method is relatively dumb - it just searches allover the RDF for items, but the difficulty to process JSON-LD according to scheme.org vocabulary nuances would be out of this translator scope.

2) There can be multiple types for the same web page, or the same page can have multiple items. But again, everything is too dynamic to figure out what belongs to what.

3) Produces more "noise". Even though it's still relatively rare to encounter this format in publisher websites, it still results to many empty or partially empty items. The more mainstream it will become, the more noise we will get. More and more data will be exposed, but we are interested only in bibliographic data, only a small part of it.

4) Format is sophisticated and I already see the trend that it's difficult for website maintainers to produce quality metadata. Some of the JSON-LDs are even invalid.

So, I think, we shouldn't recommend this format. It's a little bit "Wild West", Something that isn't so mainstream and more targeted to bibliography would be a better choice.

dstillman commented 5 years ago

For JSON-LD, I wouldn’t let the size issue affect our decision too much. We can put a library in a separate utility file without putting it in the translator itself. We would probably want to avoid injecting it into every page, but if we can do detection without the library we might be able to inject it dynamically for saving when necessary.

mrtcode commented 5 years ago

Good to know, but the missing jsonld.js is totally unrelated with the listed JSON-LD downsides.

zuphilip commented 5 years ago

To simply put, all metadata in HEAD is representing a single item (except JSON-LD, unAPI), and all metadata in BODY can be representing single or multiple items - but you never know that.

A conservative approach as you describe seems fine for me. I would restrict point 5 to cases where the H1, H2, H3 is unique within the page. Some unnecessary multiple results shouldn't be too troubleful. In the worst case, the user has to do 2 clicks more if he is only interested in the main entry.

json-ld library was 20K lines and the current JSON-LD to RDF code is 50 lines.

It is also possible to think about switching completely from RDF to JSON-LD as our main format to support, i.e. replace RDF.js. I don't know how much feasible that is and how much work this would mean. But RDF.js is always ugly to work with and some parts are really old e.g. originating from Tim Berners-Lee. However, we may not want to do this within this PR.

So, I think, we shouldn't recommend this format. It's a little bit "Wild West", Something that isn't so mainstream and more targeted to bibliography would be a better choice.

Interesting that you say mainstream has some disadvantages. - There is AFAIK only COinS which is a dedicated bibliographic format which can be embedded within a website. Every other bibliographic format has to be linked from a website with <link> or unAPI, but we don't see that often. We could try to promote them more? As a website maintainer, I can then choose some meta tags and schema.org to optimize the appearance in search machines, and this will not interfer with the actual bibliographic data.

zotero / translators

Generic webpage translator #1092

Can we do the same things we can do currently in dependent translators?

Can we do the same things in a dependent translator with some easy code?