zotero / zotero-connectors

Chrome, Firefox, Edge, and Safari extensions for Zotero
https://www.zotero.org/download/connectors
Other
531 stars 124 forks source link

MV3: Noscript tags with img breaking DOMParser in translator sandbox #477

Open adomasven opened 5 months ago

adomasven commented 5 months ago

Discovered in: https://github.com/zotero/translators/issues/3311#issuecomment-2148755216 Problem page: https://journals.ametsoc.org/view/journals/phoc/53/1/JPO-D-22-0001.1.xml

EM translator is not detected, because no meta tags are found. I've discovered, that

document.head.children.length // 191
new DOMParser().parseFromString(document.head.outerHTML, 'text/html').head.children // 20

This is because the 21st tag is

<noscript id="page_tag"><img alt="" vspace="0" hspace="0" border="0" width="1" height="1" language="//pftag.scholarlyiq.com/siqpagetag.gif?js=0"/></noscript>

Apparently, img in noscript before body is invalid, and will cause the head element to be parsed as immediately terminated and body element to begin. So it seems like this page is intentionally breaking crawlers and such from accessing the meta tags in the head element, or something like that.

Anyway, as a proposed solution, I think we should strip all <noscript> tags from <head> in MV3 before parsing.

dstillman commented 5 months ago

I'm still not getting EM on this page. The main translator is fixed, so it's less of an issue here, but this is likely preventing EM from working elsewhere.

(4)(+0000028): Translate: Binding sandbox to https://journals.ametsoc.org/view/journals/phoc/53/1/JPO-D-22-0001.1.xml

debug.js:87 (4)(+0000003): Translate: Parsing code for PubFactory Journals (8d1fb775-df6d-4069-8830-1dfe8e8387dd, 2024-06-04 18:20:00)

debug.js:87 (4)(+0000014): Translate: Parsing code for unAPI (e7e01cac-1e37-4da6-b078-a0e8343b0e98, 2019-06-10 23:11:21)

debug.js:87 (4)(+0000002): Translate: Parsing code for COinS (05d07af9-105a-4572-99f6-a8e231c0daef, 2021-06-01 17:38:46)

debug.js:87 (4)(+0000004): Translate: Parsing code for Embedded Metadata (951c027d-74ac-47d4-a107-9c3069ab7b48, 2024-03-27 20:15:00)

debug.js:87 (3)(+0000000): Translate: Prefix 'og' => 'http://ogp.me/ns#'

debug.js:87 (3)(+0000000): Translate: Prefix 'fb' => 'http://ogp.me/ns/fb#'

debug.js:87 (3)(+0000000): Translate: Prefix 'article' => 'http://ogp.me/ns/article#'

debug.js:87 (3)(+0000000): Translate: Embedded Metadata: found 0 meta tags.

debug.js:87 (4)(+0000013): Translate: Parsing code for DOI (c159dcfe-8a53-4301-a499-30f6549c340d, 2024-05-17 20:25:00)

debug.js:87 (3)(+0000000): Translate: All translator detect calls and RPC calls complete:

debug.js:87 (3)(+0000001):  PubFactory Journals: 200

debug.js:87 (3)(+0000000):  DOI: 400
adomasven commented 5 months ago

I cannot reproduce this in a new profile with the current release build.