zotero / translation-server

A Node.js-based server to run Zotero translators
Other
117 stars 48 forks source link

Open WorldCat ISBN search has extra authors #57

Closed dhimmel closed 5 years ago

dhimmel commented 5 years ago

I queried metadata for my PhD thesis using it's ISBN:

curl --silent \
  --data 9781339919881 \
   --header 'Content-Type: text/plain' \
  http://127.0.0.1:1969/search | jq

This returned:

[
  {
    "key": "AFYL2BGB",
    "version": 0,
    "itemType": "book",
    "creators": [
      {
        "firstName": "Daniel S",
        "lastName": "Himmelstein",
        "creatorType": "author"
      },
      {
        "firstName": "San Francisco",
        "lastName": "University of California",
        "creatorType": "author"
      },
      {
        "name": "Biological and Medical Informatics",
        "creatorType": "author"
      },
      {
        "firstName": "San Francisco",
        "lastName": "University of California",
        "creatorType": "author"
      }
    ],
    "tags": [],
    "libraryCatalog": "Open WorldCat",
    "language": "English",
    "title": "The hetnet awakens: understanding complex diseases through data integration and open science.",
    "date": "2016",
    "ISBN": "9781339919881",
    "abstractNote": "Human disease is complex. However, the explosion of biomedical data is providing new opportunities to improve our understanding. My dissertation focused on how to harness the biodata revolution. Broadly, I addressed three questions: how to integrate data, how to extract insights from data, and how to make science more open. To integrate data, we pioneered the hetnet---a network with multiple node and relationship types. After several preludes, we released Hetionet v1.0, which contains 2,250,197 relationships of 24 types. Hetionet encodes the collective knowledge produced by millions of studies over the last half century. To extract insights from data, we developed a machine learning approach for hetnets. In order to predict the probability that an unknown relationship exists, our algorithm identifies influential network patterns. We used the approach to prioritize disease---gene associations and drug repurposing opportunities. By evaluating our predictions on withheld knowledge, we demonstrated the systematic success of our method. After encountering friction that interfered with data integration and rapid communication, I began looking at how to make science more open. The quest led me to explore realtime open notebook science and expose publishing delays at journals as well as the problematic licensing of publicly-funded research data.",
    "extra": "OCLC: 970819555",
    "shortTitle": "The hetnet awakens"
  }
]

Notice the three creator objects that have creatorType of author:

      {
        "firstName": "San Francisco",
        "lastName": "University of California",
        "creatorType": "author"
      },
      {
        "name": "Biological and Medical Informatics",
        "creatorType": "author"
      },
      {
        "firstName": "San Francisco",
        "lastName": "University of California",
        "creatorType": "author"
      }

Is this an upstream issue or are these attributes misinterpreted by translation-server?

dstillman commented 5 years ago

You can debug this yourself using translation-server's debug output:

(3)(+0000000): HTTP GET http://www.worldcat.org/oclc/970819555?client=worldcat.org-detailed_record&page=endnotealt

(3)(+0000000): Translate: Could not retrieve any OCLC IDs

(3)(+0000263): Translate: Importing corrected RIS:

TY  - ELEC
DB  - /z-wcorg/
DP  - http://worldcat.org
ID  - 970819555
LA  - English
T1  - The hetnet awakens: understanding complex diseases through data integration and open science.
AU  - Himmelstein, Daniel S
AU  - University of California, San Francisco
AU  - Biological and Medical Informatics
AU  - University of California, San Francisco
Y1  - 2016///
SN  - 9781339919881 1339919885
AB  - Human disease is complex. However, the explosion of biomedical data is providing new opportunities to improve our understanding. My dissertation focused on how to harness the biodata revolution. Broadly, I addressed three questions: how to integrate data, how to extract insights from data, and how to make science more open. To integrate data, we pioneered the hetnet---a network with multiple node and relationship types. After several preludes, we released Hetionet v1.0, which contains 2,250,197 relationships of 24 types. Hetionet encodes the collective knowledge produced by millions of studies over the last half century. To extract insights from data, we developed a machine learning approach for hetnets. In order to predict the probability that an unknown relationship exists, our algorithm identifies influential network patterns. We used the approach to prioritize disease---gene associations and drug repurposing opportunities. By evaluating our predictions on withheld knowledge, we demonstrated the systematic success of our method. After encountering friction that interfered with data integration and rapid communication, I began looking at how to make science more open. The quest led me to explore realtime open notebook science and expose publishing delays at journals as well as the problematic licensing of publicly-funded research data.
ER  -

As you can see, Zotero is using WorldCat for this, and WorldCat is putting these values in the RIS. You can also see these authors on the WorldCat page and in the RIS you download manually from there, as well as in the APA citation.

So you'd have to report that to WorldCat. (If you do, you shouldn't mention Zotero or translation-server, since this is a problem on their website.)

Note that we can't do translator troubleshooting in this repo, so if you encounter other issues you'll need to either debug them yourself using the translation-server output or check whether they occur in Zotero and, if so, report them in the Zotero Forums or zotero/translators. Only problems that you've confirmed are specific to translation-server should be posted here.

dhimmel commented 5 years ago

You can debug this yourself using translation-server's debug output

Ah good to know. For the record, this is the output from whatever terminal ran npm start.

So you'd have to report that to WorldCat.

Reported by email:

screenshot from 2018-11-19 09-12-19

Note that we can't do translator troubleshooting in this repo

Didn't realize this repo didn't contain the actual translators. Thanks for pointing me to zotero/translators. If I remember correctly, GitHub is introducing a issue relocation option, so if this feature is available to the zotero organization, feel free to move this issue to zotero/translators.

As an aside, if I have issues on how metadata gets exported to CSL, where should those go?

dstillman commented 5 years ago

As an aside, if I have issues on how metadata gets exported to CSL, where should those go?

It depends exactly what you mean, but generally speaking, you should always try to check the problem in Zotero itself and then, if it shows up there, post to the Zotero Forums. If there's a better place, we can point you to it from there, but that's where the most people follow and respond to questions and where you'll have the best chance of getting an answer from the right person.

dhimmel commented 5 years ago

you should always try to check the problem in Zotero itself and then, if it shows up there, post to the Zotero Forums

Got it.

WorldCat issue

Bryan Baldus, a Consulting Database Specialist at OCLC, responded to my email about the author metadata. I'm quoting (with permission) the relevant parts of our email exchange below. Thanks Bryan for the prompt and detailed replies. TLDR: it is not really possible at this time for OCLC to differentiate which entities are authors based on how this information is encoded.


From me

I am writing to report an issue with the metadata for ISBN 9781339919881 in WorldCat (permalink at https://www.worldcat.org/oclc/970819555). This is the record for my PhD thesis. It has come to my attention that the author metadata includes extra authors corresponding to my PhD institution and program. Specifically, here is the metadata exported in RIS format from WorldCat:

AU - Himmelstein, Daniel S. AU - University of California, San Francisco. AU - Biological and Medical Informatics. AU - University of California, San Francisco.

Notice authors 2 through 4, which I don't think should be considered authors. I am guessing these were inserted by an automated process that may have an issue?

Is it possible to fix the authors on for my thesis (and possibly other records affected by this issue)?


From Bryan

Looking at the bibliographic record you have cited, I see 2 "added entries" (in addition to your "main entry"):

University of California, San Francisco.$bBiological and Medical Informatics. University of California, San Francisco.$tDissertations.

These are coded as "710" fields, with no further indication of the role UCSF, or UCSF Biological and Medical Informatics, had in your thesis. So even though some software may indicate that these are "authors", as seen in the WorldCat.org display or RIS format, that is more of a quirk (or overly assumptive translation of the field definition, calling all persons and corporate bodies associated with a record "authors" rather than just listing them as simply related in some way to the work when no specific relationship is available) of the software, rather than an issue with how the records are coded.

Presumably, when UCSF cataloged your thesis, they included a 710 for "University of California, San Francisco.$bBiological and Medical Informatics" to be able to bring together all dissertations by graduates from that degree program. Similarly, they likely use the name-title 710, "University of California, San Francisco.$tDissertations" to bring together all dissertations by UCSF graduates. Both appear to be reasonable headings for theses and dissertations, so they can't be removed from the record.

So, unfortuanely there's not much that can be done to improve the way these display at the moment.


From me

I primarily care about the RIS export for my thesis, but also for all other works in WorldCat. Do you think the RIS translator should omit 710-fields from the author list, or would this cause other problems?

Another note, the RIS seems to contain 3 "added entries" whereas the source metadata only contains two... is that an issue with the RIS translator.


From Bryan

For your >Another note, yes, that appears to be an issue with the RIS translator: it seems to be breaking the field into separate "AU" entries for each part of the corporate body (UCSF and BMI), not accounting for the 2nd field being a name-title (and thus it should be treated more as a title than as an author), and then not deduplicating the resulting "AU" strings, so UCSF appears twice rather than once (if it were treated as an author of "Dissertations"; with BMI, it should be treated as a whole, "UCSF. BMI", not as 2 separate entities).

As for the first question, while in this case, the 710s are not acting as authors, that is not always the case for all works. 7XX fields (700 for personal names, 710 for corporate bodies, 711 for conferences) are used for any additional entities related to a work other than the main entity responsible for the work. So for any work with 2 authors, one would be given main entry (coded as a 100 (or 110 or 111, following the pattern seen above for 7XX)), the other would be given as an added entry, because according to current encoding standards, each record can only have 1 main entry. So, in that case, "AU" would be the appropriate relationship for both authors even though one is listed as a 100 and the other as a 700. So, the main problem is that 700 stores data with a variety of different relationships to the work being described. In more recent records, catalogers have begun adding subfields specifying the relationship between the person/etc. and the work (e.g., "Smith, John, editor"; "Jones, Sally, author"; "MacIntosh, James. aut" (a coded way of saying ", author")). That may one day assist with the translation process into RIS format, or other display formats, at least when the field has such relationship codes, terms, or designators. Until then, though, there's no easy way for a computer to know whether a 700 should be called an "author", an "editor", or any number of any other possible things. So, if the RIS translator left out 7XX fields from the author list, it would be dropping relevant co-authors in some cases, but in other cases would be dropping potentially less relevant secondary entities related to the work. In addition, since editors (such as those who put together a compilation of articles) currently can't get main entry, they will all be listed in 700s. So, if 7XX were omitted from the RIS translation, the citation would exclude any editors (vs. calling them "AU" in the current translation). The same would be true for translators, performers, directors, producers, etc.