Generic VuFind web translator

zoe-translates commented 11 months ago

Summary: We should support a generic VuFind web translator, because of the ubiquity of VuFind and the needs of its userbase. Cf. https://github.com/zotero/translators/pull/2969#issuecomment-1777846439

This is a pretty common catalog in Germany, were lots of libraries used to support (just) the Citavi software, which has become much less attractive because of both pricing and declining customer support (I'm being told), with lots of libraries looking at Zotero instead -- given the customizabilitiy of VuFind, this would likely be a bit tricky, but have high impact.

With one, Zotero will work reasonably well out of the box for VuFind sites while making it easy to write one with customizations for a specific site.

How to do this (general plan):

[x] Prepare a generic VuFind translator based on #2969 (EDIT 👉🏼: The draft PR is #3173)
[ ] Update existing translators to use this generic VuFind translator (see the comment linked above)
[ ] In the PR backlog, identify the new translators that could benefit from this

Technical details for step 1:

[x] Housekeeping based on #2969 (rebase, update metadata, transform code style into our usual one, etc.)
[x] Turn it into a generic translator with configurable options that can be set from parent translator. Mostly, this is about choosing the best input format (MARC, RIS, EndNote, etc..) for the parent translator's particular target. For different institutions, different subsets of export formats are supported, and among them some format provides better metadata than others (details and examples below)
[ ] Investigate the feasibility of using the "bulk export" facility for handling multiple. However, this doesn't appear to be a common feature (supported by IxTheo and Finland National Library, but not by Wellesley College)

An example (for ixtheo) is a GET request to https://ixtheo.de/Cart/doExport?f=RIS&i[]=Solr|1640914242&i[]=Solr|1165461676&i[]=Solr|1643208144 which exports the three items as one RIS file. If this can be done in a generic way, it will be more efficient than handling each selected item separately. UI-wise, in the Zotero item selector window, we can support pre-selecting those already selected on the web app search page. For what the search page looks like, see https://ixtheo.de/Search/Results?lookfor=christus+victor&type=AllFields&botprotect=

The logic of setting the input format

Depending on the institution, some formats may be better than others. For example:

Wellesley College: https://libcat.wellesley.edu/Record/ebs14973473e Correct type (book) from RIS, incorrect (journalArticle) from EndNote
IxTheo: https://ixtheo.de/Record/859089061 Correct (or more accurate) type (map) from EndNote, incorrect (or less accurate) format (book) from RIS

So the proposed method is to allow any translator that calls this one to set a preferred input format (based on domain-specific knowledge about that resource; default nothing). If this explicitly-set preferred format is not available, it's a throwable error (because someone explicitly set it).

If there's no preference set, the steps go as follows.

From the doc, get a list of support formats by scraping.
Filter the list by what we support, while putting it in the order of "generically better" (??? should we? By this I mean, suppose we can say "most of the time, MARC is better than RIS, and RIS better than EndNote.")
We can
- either simply take the first (best) format and use that for all of the multiple items to be translated (we trust that the site only lists supported formats), (✅ this is my personal favourite)
- or first try from the head of the list of formats with the first item to translate. If that works, we use that format for all items. If somehow not (which is not very likely), try further down the format-list until we find a working one, and use that for subsequent items.
- ❌ What we will not do is to repeat this trial-and-error for each item to be translated -- the current approach in #2969, because this is wasteful and botty.

We might even consider a pre-built "best format" table keyed by domain names, for example, associating RIS with Wellesley and EndNote with IxTheo. But this looks out of place in a generic translator and if we do it, we should keep the table minimal, only supporting well-known sites.

@adam3smith How do you think about this? I appreciate your comments. Thanks!

adam3smith commented 11 months ago

Generally sounds good to me. Thanks for jumping on this so quickly. Couple of quick thoughts:

Filter the list by what we support, while putting it in the order of "generically better" (??? should we? By this I mean, suppose we can say "most of the time, MARC is better than RIS, and RIS better than EndNote.")

I think RIS and Endnote (I assume that's Refer/Bibix?) will tend to be about the same for library catalogs, MARC should indeed almost always be better, so yes, a generically better list should work (we make this call with translators all the time, of course)

[Bulk Export] If this can be done in a generic way, it will be more efficient than handling each selected item separately.

Given how most people use library catalogs -- they're rarely used for systematic-review-type mass import -- and the fact that library catalogs virtually never have strong policies/measures against scraping, I don't think we need to be super concerned about efficiency on multiple import here if it comes at any price (e.g., we can't get to MARC)

We might even consider a pre-built "best format" table keyed by domain names, for example, associating RIS with Wellesley and EndNote with IxTheo. But this looks out of place in a generic translator and if we do it, we should keep the table minimal, only supporting well-known sites.

I think this is preferable to having individual mini-translators for those instances that require additional maintenance. Several library catalog translators have institution-specific variation in the code, and especially if this is well-contained, as in a mapping table, I find it much preferable to separate mini translators calling a generic one and adding little. I'd consider most university libraries "well-known sites"

zoe-translates commented 11 months ago

OK so as I understand it, we could do this:

In the generic translator, put some effort to "snoop" the most appropriate format based on domain name (institution). The goal is to make the generic translator work out of the box most of the time.
Because this cannot be guaranteed to work all the time, and because VuFind is highly customizable, we should think about more specialized translators loading this one and make it possible to override specific traits (most importantly the input format). This is what I'm doing ATM, by providing some flags that the parent translator can set.
For multiple import, the efficiency gained by bulk import is the reduction of network overhead that scales with the number of items imported. But this is not essential and I'll first keep this idea open but relegate it to the backburner.

As for formats.. I get it that MARC will be closest to the underlying data and the most informative, but I'm not sure if our support for it is the best. But anyway this is perhaps a different issue. In any case, I'll try to find a balance between making it work out of the box for more sites, and simplicity of the code.

Thank you for your feedback!

adam3smith commented 11 months ago

I get it that MARC will be closest to the underlying data and the most informative, but I'm not sure if our support for it is the best.

I think our MARC translator is quite good. The biggest advantage is that the format is well-defined, so if something is wrong we can fix it. RIS barely has public specs and even those are routinely ignored (hence the 1500+ lines translator....). That said -- and someone just reminded me of this -- MARC is terrible for journal articles (all journal info gets put into a single poorly defined field 773). Most library catalogs don't have MARC for journal articles, but where they do, that might be a place where RIS is preferable (sorry, I know that complicates things further).

zoe-translates commented 11 months ago

Exactly. Case in point: https://www.finna.fi/Record/deutschebibliothek.37415 -- a single article in a yearbook, which our translator identifies as book.

adam3smith commented 11 months ago

yeah, but so does the RIS (which is generated from the MARC, I assume)

zoe-translates commented 11 months ago

Interesting stuff: For some (all??) libraries, it seems that MARC is actually supported nevertheless even if it is not listed as an available export format. Examples:

https://ixtheo.de/Record/1640914242
https://librarysearch.aut.ac.nz/vufind/Record/1253127 And in general MARC is more likely to produce more accurate item types (exception is the Finna item in the earlier comment)

mathieugrimault commented 11 months ago

MARC export is enabled by default in VuFind but this is customizable.

I haven't dig the code but i feel that VuFind export the original MARC record, which can be MARC21 or UNIMARC.

May i suggest another way : a specific export in VuFind for Zotero ? Is there a kind a of a Zotero format ?

zoe-translates commented 10 months ago

@mathieugrimault, could you tell me what you mean by "a specific export in VuFind for Zotero"? I don't think VuFind by default exports to "Zotero format" (Zotero RDF maybe)?

Where I'm at, the translator will try MARC first, followed by other formats understood by Zotero. On top of this, there are site-specific idiosyncrasies. Some sites, for instance, may disable MARC, while others may produce a MARC record poorly processed by Zotero for some reason. For a few known cases, I'm using basically a pre-defined list of best-supported formats.

mathieugrimault commented 10 months ago

The idea was to add some code to VuFind, an export in something easy to import for Zotero. I've seen some files in Zotero RDF but it seems quite complicated to implement and i haven't found much documentation about it. And it is probably overkill when VuFind can export in RDF.

maccabeelevine commented 9 months ago

Hi folks, I'm happy to see this work and I mentioned it at this week's VuFind developer meeting. Thanks to all working on this and the prior translators. My institution is interested in filling some gaps as well. Given the goal above

Zotero will work reasonably well out of the box for VuFind sites while making it easy to write one with customizations for a specific site.

I'd like to understand the full effect the translator would have, specifically how it would interact with or (I think) override the existing support for Zotero. VuFind out-of-the-box exposes a COinS OpenURL with parameters customized based on format for book, article, journal and unknown formats. I'm new to Zotero but I think the effect of publishing this translator would be tell the browser plug-in to ignore that OpenURL entirely and do a fresh export based on one of the translator's defined formats. If that's true, then I think it's important to ensure that the translator handles all params currently exposed via OpenURL for all the currently supported formats. Otherwise this would downgrade from some of the built-in support. Please correct me if I'm off base!

Related, I think this translator (and its predecessors) are necessary in the first place because the built-in support is not enough. But do we know specifically what's missing? Because it might also be possible to define those gaps and address them in VuFind itself, i.e. a VuFind config.ini setting that automatically links to RIS or RDF or whatever Zotero would be happiest identifying to pull in automatically without a translator -- if Zotero does that. Sorry to be late to the discussion and I don't want to throw in a wrench if the translator is already robust. Happy to discuss, and to contribute development on the VuFind side if needed.

damien-git commented 2 weeks ago

Any update on this ? We (MSU) are looking for a way to improve Zotero support, and always include the call numbers. COinS OpenURL does not support call numbers, so it looks like we should use a translator. But I would hate to create yet another Vufind translator when there is an effort to create a generic one. In our case the call number is in 952e (the MARC record is coming from FOLIO), so I guess we would have to use customizeMARC(). The RIS export would work too (we have customized it to add more fields) - it would be nice to have an option to prefer an export or another when using the generic translator.

zotero / translators

Generic VuFind web translator #3172