zotero / translators

Zotero Translators
http://www.zotero.org/support/dev/translators
1.19k stars 744 forks source link

Expand embedded metadata detection #77

Open avram opened 12 years ago

avram commented 12 years ago

The <link rel="alternate" /> syntax for providing alternate representations should be used when we look for embedded metadata. A recent discussion notes a site providing dissertations that we don't import correctly. In addition to Google/Highwire metadata which we're parsing, it includes such <link rel="alternate" /> references to structured descriptions:

<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=librismarcxml&pids=diva2:459013"
  rel="alternate" title="MARC-XML Representation" type="text/xml" />
<link href="http://umu.diva-portal.org/smash/getreferences?referenceFormat=swepubmods&pids=diva2:459013"
  rel="alternate" title="MODS Representation" type="text/xml" />

I don't think we can expect to read these as-is, since the text/xml type is too vague, but we should look for known types for formats we do read, just like we do for intercepting RIS/BibTeX download. That means application/mods+xml for MODS, etc.

aurimasv commented 12 years ago

With recent changes to Embeded Metadata, EM performs quite well on the linked page. It properly detects the type as thesis, which is not detected with either MODS or MARCXML.

In this particular case, MARCXML performs quite poorly, but it does supply numPages and seriesNumber, while the others do not.

MODS supplies the full abstract and picks up additional authors, which are probably more appropriately classified as contributors (i.e. professor, university, etc. Is this even desirable?). It also contains the ISBN and publication title, but these are not preserved in a thesis itemType.

So the advantage here is that EM detects it as a thesis, but it would be nice to get the full abstract. If we decide to supplement EM data with MARC or MODS, it may become difficult to determine which data we would rather prefer. Would we include all the authors? Add all the notes? (MODS adds 7 additional notes, which are fairly redundant and don't seem to enhance the metadata)

One feasible solution, I think, would be to parse linked MODS and MARC pages and only supplement fields that are completely missing in the EM translator. This would leave out the abstract. Perhaps we can say that MODS/MARC abstracts will always be more complete?

dstillman commented 8 years ago

I was just wondering to myself why we didn't do this. Seems like it would be absolutely trivial to implement, for us and publishers. We're also trying to improve embedded metadata support by adding support for JSON-LD, but this would be a lower-tech solution when sites already have BibTeX, RIS, etc. — basically, an easier, more standards-compliant, non-abandoned unAPI. The HTML 5.1 draft also allows <link> in body content with itemprop, which would allow the use of this for multiple items in a page.

Any reason we shouldn't do this? I guess the biggest downside is that, as with unAPI, we'd need to make a separate request for each link and run detection on the result in order to show a proper icon.

adam3smith commented 8 years ago

no reason from my side. @zuphilip has brought this up, too (he'd know where, but in some related EM discussion), and it seems like a very good idea to me.

zuphilip commented 8 years ago

Do you mean this discussion about blacklight discovery system: https://github.com/zotero/translators/issues/893#issuecomment-107151177 ?

adam3smith commented 8 years ago

yup, thanks.

zuphilip commented 6 years ago

Here are some examples from Blacklight catalogs:

dstillman commented 5 years ago

Another example, where MODS is available:

https://purl.stanford.edu/fv751yt5934

<link rel="alternate" title="MODS XML" type="application/xml" href="https://purl.stanford.edu/fv751yt5934.mods" />

For a case like this I think we'd just want to look for 'mods' in the title and href when the type is application/xml or text/xml.