zotero / zotero-connectors

Chrome, Firefox, Edge, and Safari extensions for Zotero
https://www.zotero.org/download/connectors
Other
476 stars 117 forks source link

MV3: Noscript tags with img breaking DOMParser in translator sandbox #477

Open adomasven opened 4 weeks ago

adomasven commented 4 weeks ago

Discovered in: https://github.com/zotero/translators/issues/3311#issuecomment-2148755216 Problem page: https://journals.ametsoc.org/view/journals/phoc/53/1/JPO-D-22-0001.1.xml

EM translator is not detected, because no meta tags are found. I've discovered, that

document.head.children.length // 191
new DOMParser().parseFromString(document.head.outerHTML, 'text/html').head.children // 20

This is because the 21st tag is

<noscript id="page_tag"><img alt="" vspace="0" hspace="0" border="0" width="1" height="1" language="//pftag.scholarlyiq.com/siqpagetag.gif?js=0"/></noscript>

Apparently, img in noscript before body is invalid, and will cause the head element to be parsed as immediately terminated and body element to begin. So it seems like this page is intentionally breaking crawlers and such from accessing the meta tags in the head element, or something like that.

Anyway, as a proposed solution, I think we should strip all <noscript> tags from <head> in MV3 before parsing.

dstillman commented 3 weeks ago

I'm still not getting EM on this page. The main translator is fixed, so it's less of an issue here, but this is likely preventing EM from working elsewhere.

(4)(+0000028): Translate: Binding sandbox to https://journals.ametsoc.org/view/journals/phoc/53/1/JPO-D-22-0001.1.xml

debug.js:87 (4)(+0000003): Translate: Parsing code for PubFactory Journals (8d1fb775-df6d-4069-8830-1dfe8e8387dd, 2024-06-04 18:20:00)

debug.js:87 (4)(+0000014): Translate: Parsing code for unAPI (e7e01cac-1e37-4da6-b078-a0e8343b0e98, 2019-06-10 23:11:21)

debug.js:87 (4)(+0000002): Translate: Parsing code for COinS (05d07af9-105a-4572-99f6-a8e231c0daef, 2021-06-01 17:38:46)

debug.js:87 (4)(+0000004): Translate: Parsing code for Embedded Metadata (951c027d-74ac-47d4-a107-9c3069ab7b48, 2024-03-27 20:15:00)

debug.js:87 (3)(+0000000): Translate: Prefix 'og' => 'http://ogp.me/ns#'

debug.js:87 (3)(+0000000): Translate: Prefix 'fb' => 'http://ogp.me/ns/fb#'

debug.js:87 (3)(+0000000): Translate: Prefix 'article' => 'http://ogp.me/ns/article#'

debug.js:87 (3)(+0000000): Translate: Embedded Metadata: found 0 meta tags.

debug.js:87 (4)(+0000013): Translate: Parsing code for DOI (c159dcfe-8a53-4301-a499-30f6549c340d, 2024-05-17 20:25:00)

debug.js:87 (3)(+0000000): Translate: All translator detect calls and RPC calls complete:

debug.js:87 (3)(+0000001):  PubFactory Journals: 200

debug.js:87 (3)(+0000000):  DOI: 400
adomasven commented 3 weeks ago

I cannot reproduce this in a new profile with the current release build.