zotero / translators

Zotero Translators
http://www.zotero.org/support/dev/translators
1.2k stars 745 forks source link

Embedded metadata /RDF not getting dublin core dc / dcterms tags from WHO website #1823

Closed mvolz closed 2 years ago

mvolz commented 5 years ago

https://github.com/zotero/translation-server/issues/82

I'm using translation-server and I'm getting no response from the translator:

curl -d 'http://apps.who.int/iris/handle/10665/70863' -H 'Content-Type: text/plain' http://127.0.0.1:1969/web

returns []

but there are a lot of metadata tags in the site:


<title>Consensus document on the epidemiology of severe acute respiratory syndrome (&lrm;SARS)&lrm;</title>
--
  | <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
  | <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
  | <meta name="DC.creator" content="World Health Organization" xml:lang="en" />
  | <meta name="DCTERMS.dateAccepted" content="2012-06-17T01:51:09Z" scheme="DCTERMS.W3CDTF" />
  | <meta name="DCTERMS.available" content="2012-06-17T01:51:09Z" scheme="DCTERMS.W3CDTF" />
  | <meta name="DCTERMS.created" content="2003" xml:lang="en" scheme="DCTERMS.W3CDTF" />
  | <meta name="DCTERMS.issued" content="2003" xml:lang="en" scheme="DCTERMS.W3CDTF" />
  | <meta name="DC.identifier" content="WHO/CDS/CSR/GAR/2003.11" xml:lang="en" />
  | <meta name="DC.identifier" content="http://www.who.int/iris/handle/10665/70863" scheme="DCTERMS.URI" />
  | <meta name="DC.description" content="WHO/CDS/CSR/GAR/2003.11" xml:lang="en" />
  | <meta name="DC.description" content="46 p." xml:lang="en" />
  | <meta name="DC.language" content="en" xml:lang="en" scheme="DCTERMS.RFC1766" />
  | <meta name="DC.publisher" content="Geneva : World Health Organization" xml:lang="en" />
  | <meta name="DC.subject" content="Disease outbreaks" xml:lang="en" scheme="DCTERMS.MESH" />
  | <meta name="DC.subject" content="Severe acute respiratory syndrome" xml:lang="en" scheme="DCTERMS.MESH" />
  | <meta name="DC.subject" content="Epidemiologic surveillance" xml:lang="en" scheme="DCTERMS.MESH" />
  | <meta name="DC.subject" content="Communicable Diseases and their Control" xml:lang="en" />
  | <meta name="DC.title" content="Consensus document on the epidemiology of severe acute respiratory syndrome (‎SARS)‎" xml:lang="en" />
  | <meta content="2003" name="citation_publication_date">
  | <meta content="Consensus document on the epidemiology of severe acute respiratory syndrome (SARS)" name="citation_title">
  | <meta content="WHO/CDS/CSR/GAR/2003.11" name="citation_technical_report_number">
  | <meta content="Geneva : World Health Organization" name="citation_publisher">
  | <meta content="en" name="citation_language">
  | <meta content="World Health Organization" name="citation_author">
  | <meta content="https://apps.who.int/iris/bitstream/10665/70863/1/WHO_CDS_CSR_GAR_2003.11_eng.pdf" name="citation_pdf_url">
  | <meta content="2003" name="citation_date">
  | <meta content="https://apps.who.int/iris/handle/10665/70863" name="citation_abstract_html_url">
  | <meta content="2012-06-17T01:51:09Z" name="citation_online_date">

This also doesn't seem to work in the client either.

(3)(+0048252): HTTP GET https://apps.who.int/iris/handle/10665/70863

(3)(+0000622): Translators: Looking for translators for https://apps.who.int/iris/handle/10665/70863

(4)(+0000000): Translate: Binding sandbox to https://apps.who.int/iris/handle/10665/70863

(4)(+0000001): Translate: Parsing code for unAPI (e7e01cac-1e37-4da6-b078-a0e8343b0e98, 2018-05-12 15:58:17)

(4)(+0000001): Translate: Parsing code for COinS (05d07af9-105a-4572-99f6-a8e231c0daef, 2015-06-04 03:25:10)

(4)(+0000002): Translate: Parsing code for Embedded Metadata (951c027d-74ac-47d4-a107-9c3069ab7b48, 2018-11-01 19:46:46)

(3)(+0000002): Translate: Embedded Metadata: found 31 meta tags.

(3)(+0000000): Translate: Creating translate instance of type import in sandbox

(4)(+0000000): Translate: Binding sandbox to https://apps.who.int/iris/handle/10665/70863

(4)(+0000000): Translate: Parsing code for RDF (5e3ad958-ac79-463d-812b-a86a9235c28f, 2018-10-07 16:32:26)

(3)(+0000002): Translate: Initializing RDF data store

(4)(+0000002): Translate: Parsing code for DOI (c159dcfe-8a53-4301-a499-30f6549c340d, 2016-11-05 10:57:01)

(3)(+0000006): Translate: All translator detect calls and RPC calls complete:

(3)(+0000000):  COinS: 310

(3)(+0000000):  Embedded Metadata: 320

(5)(+0000000): Translate: Running handler 0 for translators

(5)(+0000000): Translate: Running handler 1 for translators

(4)(+0000000): Translate: Parsing code for COinS (05d07af9-105a-4572-99f6-a8e231c0daef, 2015-06-04 03:25:10)

(3)(+0000001): Translate: Beginning translation with COinS

(3)(+0000003): [
    "0": {
        "itemType": "webpage"
        "creators": []
        "notes": []
        "tags": []
        "seeAlso": []
        "attachments": []
        "repository": false
        "url": "http://www.who.int/iris/handle/10665/186680"
        "accessDate": ""
        "contextObject": "ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rft_id=WHO%2FMERS%2FClinical%2F15.1&rft_id=http%3A%2F%2Fwww.who.int%2Firis%2Fhandle%2F10665%2F186680&rfr_id=info%3Asid%2Fdspace.org%3Arepository&rft.relation=10665%2F178529"
        "complete": function() {...}
    }
    "1": {
        "itemType": "webpage"
        "creators": []
        "notes": []
        "tags": []
        "seeAlso": []
        "attachments": []
        "repository": false
        "url": "http://www.who.int/iris/handle/10665/249473"
        "accessDate": ""
        "contextObject": "ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rft_id=WPR%2FRC54.R7&rft_id=http%3A%2F%2Firis.wpro.who.int%2Fhandle%2F10665.1%2F9649&rft_id=http%3A%2F%2Fwww.who.int%2Firis%2Fhandle%2F10665%2F249473&rfr_id=info%3Asid%2Fdspace.org%3Arepository&rft.relation=10665.1%2F11505"
        "complete": function() {...}
    }
    "2": {
        "itemType": "webpage"
        "creators": []
        "notes": []
        "tags": []
        "seeAlso": []
        "attachments": []
        "repository": false
        "url": "http://www.who.int/iris/handle/10665/249482"
        "accessDate": ""
        "contextObject": "ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rft_id=WPR%2FRC55.R5&rft_id=http%3A%2F%2Firis.wpro.who.int%2Fhandle%2F10665.1%2F9698&rft_id=http%3A%2F%2Fwww.who.int%2Firis%2Fhandle%2F10665%2F249482&rfr_id=info%3Asid%2Fdspace.org%3Arepository&rft.relation=10665.1%2F11464"
        "complete": function() {...}
    }
]

(3)(+0000000): Translate: Looking up contextObject

(3)(+0000000): Translate: Creating translate instance of type search in sandbox

(4)(+0000000): Translate: Binding sandbox to https://apps.who.int/iris/handle/10665/70863

(4)(+0000001): Translate: Parsing code for EIDR (79c3d292-0afc-42a1-bd86-7e706fc35aa5, 2017-06-03 11:41:00)

(4)(+0000000): Translate: Parsing code for Crossref (11645bd1-0420-45c1-badb-53fb41eeb753, 2019-01-07 08:14:17)

(4)(+0000001): Translate: Parsing code for Crossref REST (0a61e167-de9a-4f93-a68a-628b48855909, 2018-06-16 12:00:00)

(4)(+0000000): Translate: Parsing code for Library of Congress ISBN (c070e5a2-4bfd-44bb-9b3c-4be20c50d0d9, 2018-04-13 13:41:00)

(4)(+0000000): Translate: Parsing code for Gemeinsamer Bibliotheksverbund ISBN (de0eef58-cb39-4410-ada0-6b39f43383f9, 2018-04-13 13:41:00)

(4)(+0000001): Translate: Parsing code for DataCite (9f1fb86b-92c8-4db7-b8ee-0b481d456428, 2018-07-09 07:15:00)

(4)(+0000000): Translate: Parsing code for Open WorldCat (c73a4a8c-3ef1-4ec8-8229-7531ee384cc4, 2017-03-19 23:26:57)

(4)(+0000001): Translate: Parsing code for PubMed (3d0231ce-fd4b-478c-b1d3-840389e5b68c, 2019-01-07 07:52:09)

(4)(+0000001): Translate: Parsing code for arXiv.org (ecddda2e-4fc6-4aea-9f17-ef3b56d7377a, 2018-04-17 20:00:00)

(4)(+0000000): Translate: Parsing code for Lulu (9a0ecbda-c0e9-4a19-84a9-fc8e7c845afa, 2016-11-04 21:18:44)

(4)(+0000001): Translate: Parsing code for mEDRA (d9b57cd5-5a9c-4946-8616-3bdf8edfcbb5, 2014-05-26 03:50:55)

(4)(+0000000): Translate: Parsing code for Airiti (5f0ca39b-898a-4b1e-b98d-8cd0d6ce9801, 2018-04-17 21:16:52)

(3)(+0000001): Translate: All translator detect calls and RPC calls complete:

(3)(+0000000):  No suitable translators found

(5)(+0000000): Translate: Running handler 0 for translators

(3)(+0000000): Translate: Looking up contextObject

(3)(+0000000): Translate: Creating translate instance of type search in sandbox

(5)(+0000000): Translate: Running handler 1 for translators

(5)(+0000000): Translate: Running handler 2 for translators

(4)(+0000000): Translate: Binding sandbox to https://apps.who.int/iris/handle/10665/70863

(4)(+0000001): Translate: Parsing code for EIDR (79c3d292-0afc-42a1-bd86-7e706fc35aa5, 2017-06-03 11:41:00)

(4)(+0000001): Translate: Parsing code for Crossref (11645bd1-0420-45c1-badb-53fb41eeb753, 2019-01-07 08:14:17)

(4)(+0000000): Translate: Parsing code for Crossref REST (0a61e167-de9a-4f93-a68a-628b48855909, 2018-06-16 12:00:00)

(4)(+0000001): Translate: Parsing code for Library of Congress ISBN (c070e5a2-4bfd-44bb-9b3c-4be20c50d0d9, 2018-04-13 13:41:00)

(4)(+0000000): Translate: Parsing code for Gemeinsamer Bibliotheksverbund ISBN (de0eef58-cb39-4410-ada0-6b39f43383f9, 2018-04-13 13:41:00)

(4)(+0000000): Translate: Parsing code for DataCite (9f1fb86b-92c8-4db7-b8ee-0b481d456428, 2018-07-09 07:15:00)

(4)(+0000001): Translate: Parsing code for Open WorldCat (c73a4a8c-3ef1-4ec8-8229-7531ee384cc4, 2017-03-19 23:26:57)

(4)(+0000000): Translate: Parsing code for PubMed (3d0231ce-fd4b-478c-b1d3-840389e5b68c, 2019-01-07 07:52:09)

(4)(+0000001): Translate: Parsing code for arXiv.org (ecddda2e-4fc6-4aea-9f17-ef3b56d7377a, 2018-04-17 20:00:00)

(4)(+0000001): Translate: Parsing code for Lulu (9a0ecbda-c0e9-4a19-84a9-fc8e7c845afa, 2016-11-04 21:18:44)

(4)(+0000000): Translate: Parsing code for mEDRA (d9b57cd5-5a9c-4946-8616-3bdf8edfcbb5, 2014-05-26 03:50:55)

(4)(+0000001): Translate: Parsing code for Airiti (5f0ca39b-898a-4b1e-b98d-8cd0d6ce9801, 2018-04-17 21:16:52)

(3)(+0000001): Translate: All translator detect calls and RPC calls complete:

(3)(+0000000):  No suitable translators found

(5)(+0000000): Translate: Running handler 0 for translators

(3)(+0000000): Translate: Looking up contextObject

(3)(+0000000): Translate: Creating translate instance of type search in sandbox

(5)(+0000000): Translate: Running handler 1 for translators

(5)(+0000000): Translate: Running handler 2 for translators

(4)(+0000000): Translate: Binding sandbox to https://apps.who.int/iris/handle/10665/70863

(4)(+0000000): Translate: Parsing code for EIDR (79c3d292-0afc-42a1-bd86-7e706fc35aa5, 2017-06-03 11:41:00)

(4)(+0000000): Translate: Parsing code for Crossref (11645bd1-0420-45c1-badb-53fb41eeb753, 2019-01-07 08:14:17)

(4)(+0000001): Translate: Parsing code for Crossref REST (0a61e167-de9a-4f93-a68a-628b48855909, 2018-06-16 12:00:00)

(4)(+0000000): Translate: Parsing code for Library of Congress ISBN (c070e5a2-4bfd-44bb-9b3c-4be20c50d0d9, 2018-04-13 13:41:00)

(4)(+0000000): Translate: Parsing code for Gemeinsamer Bibliotheksverbund ISBN (de0eef58-cb39-4410-ada0-6b39f43383f9, 2018-04-13 13:41:00)

(4)(+0000000): Translate: Parsing code for DataCite (9f1fb86b-92c8-4db7-b8ee-0b481d456428, 2018-07-09 07:15:00)

(4)(+0000000): Translate: Parsing code for Open WorldCat (c73a4a8c-3ef1-4ec8-8229-7531ee384cc4, 2017-03-19 23:26:57)

(4)(+0000000): Translate: Parsing code for PubMed (3d0231ce-fd4b-478c-b1d3-840389e5b68c, 2019-01-07 07:52:09)

(4)(+0000000): Translate: Parsing code for arXiv.org (ecddda2e-4fc6-4aea-9f17-ef3b56d7377a, 2018-04-17 20:00:00)

(4)(+0000000): Translate: Parsing code for Lulu (9a0ecbda-c0e9-4a19-84a9-fc8e7c845afa, 2016-11-04 21:18:44)

(4)(+0000000): Translate: Parsing code for mEDRA (d9b57cd5-5a9c-4946-8616-3bdf8edfcbb5, 2014-05-26 03:50:55)

(4)(+0000000): Translate: Parsing code for Airiti (5f0ca39b-898a-4b1e-b98d-8cd0d6ce9801, 2018-04-17 21:16:52)

(3)(+0000001): Translate: All translator detect calls and RPC calls complete:

(3)(+0000000):  No suitable translators found

(5)(+0000000): Translate: Running handler 0 for translators

(5)(+0000000): Translate: Running handler 1 for translators

(5)(+0000000): Translate: Running handler 2 for translators

(3)(+0000000): Translate: Translation successful

(5)(+0000000): Translate: Running handler 0 for done
adam3smith commented 5 years ago

If you look at the log, Zotero isn't using Embedded metadata to import. It uses COinS instead. More often than not, that's the better choice, although it very much isn't here. There is ongoing work to integrate the various generic translators, but until that's finished, this will occasionally happen.

dstillman commented 5 years ago

This is a weird case, though — because of how the COinS translator works, it doesn't even fall back to webpage saving, which is definitely not ideal. It seems like we might be able to throw an error when newItems.length is 0 to trigger the fallback. @mrtcode, should we do that here (if possible) or just wait for the combined translator?

mrtcode commented 5 years ago

Yeah we could throw an error in the COinS translator. Just we have to make sure this wouldn't have any unintended consequences for other translators which are embedding it. But this isn't the only COinS translator problem. It also blocks EM translator by returning poor metadata, which is more common than no items.

And generally, translators that are targeted to many websites should be very carefully developed.

adam3smith commented 2 years ago

I think we can close this specific issue: There's a dedicated WHO translator that works on the above-linked page and there are separate tickets (most notably #1092 ) for comprehensive generic translator