zotero / translation-server

A Node.js-based server to run Zotero translators
Other
121 stars 50 forks source link

translation-server should exclude css/html/js from metadata - site is returning js and css in author field. #111

Open mvolz opened 4 years ago

mvolz commented 4 years ago

curl -d 'https://www.milliyet.com.tr/gundem/canan-dagdeviren-kimdir-2392696' -H 'Content-Type: text/plain' http://127.0.0.1:1969/web

gives the results

[{"key":"F4KIRRQ3","version":0,"itemType":"webpage","creators":[{"firstName":"player-inline {display: inline-block;padding-bottom: 56 25%;position: relative;width: 100%;z-index: 5;} player-box {height: 100%;left: 0;position: absolute;top: 0;width: 100%;}$ ready{quarkPlayer = new","lastName":"QuarkPlayer","creatorType":"author"},{"name":"bufferLength:5","creatorType":"author"},{"firstName":"autoPlay:","lastName":"false","creatorType":"author"},{"firstName":"subTitles:","lastName":"false","creatorType":"author"},{"firstName":"showAds:","lastName":"true","creatorType":"author"},{"firstName":"showNotification:","lastName":"false","creatorType":"author"},{"name":"showB","creatorType":"author"},{"firstName":"widthSelector:","lastName":"true","creatorType":"author"},{"firstName":"customMenu:","lastName":"false","creatorType":"author"},{"name":"Preload: 'None'","creatorType":"author"},{"firstName":"Playsinline:","lastName":"True","creatorType":"author"},{"firstName":"Live:","lastName":"False","creatorType":"author"},{"name":"Poster: 'Https://I2.milimaj.com/I/Milliyet/75/800x450/5e4231ac55427f1b70cf438a.jpg'","creatorType":"author"},{"name":"sources:","creatorType":"author"},{"name":"playType: \"newsdetail\"","creatorType":"author"},{"name":"adTags","creatorType":"author"},{"name":"cust_params=keyword%3dVid_duration_1_3%2cVid_pubdate_new%2cseeding_false%2cautoplay_false%2csilentstart_false%2cst_none%2cpremium_video%26contentid%3d6142190%26kategori%3dml_mtv_milliyet-tv_haberler%26catlist%3dc1_milliyet-tv%2cc2_haberler%2cCct_sivas%2cCct_soguk-hava%2cCct_sicak-su%2cCct_buz%2cct_sivas%2cct_soguk-hava%2cct_sicak-su%2cct_buz%26pub_name%3dmilliyet","creatorType":"author"},{"name":"vpos=preroll\"}","creatorType":"author"},{"name":"{\"id\":\"overlay\"","creatorType":"author"},{"name":"\"offset\":\"00:00:05.000\"","creatorType":"author"},{"name":"\"type\":\"nonlinear\"","creatorType":"author"},{"name":"\"url\":\"https://pubads.g.doubleclick.net/gampad/ads?sz=640x360","creatorType":"author"},{"name":"iu=/9927946/milliyet/sitegeneli/overlay","creatorType":"author"},{"name":"impl=s","creatorType":"author"},{"name":"gdfp_req=1","creatorType":"author"},{"name":"env=vp","creatorType":"author"},{"name":"output=vast","creatorType":"author"},{"name":"unviewed_position_start=1","creatorType":"author"},{"name":"url=https://www.milliyet.com.tr/gundem/canan-dagdeviren-kimdir-2392696","creatorType":"author"},{"name":"description_url=http%3a%2f%2fwww.milliyet.com.tr%2fgundem%2fcanan-dagdeviren-kimdir-2392696","creatorType":"author"},{"name":"correlator=","creatorType":"author"},{"name":"cust_params=keyword%3dVid_duration_1_3%2cVid_pubdate_new%2cseeding_false%2cautoplay_false%2csilentstart_false%2cst_none%2cpremium_video%26contentid%3d6142190%26kategori%3dml_mtv_milliyet-tv_haberler%26catlist%3dc1_milliyet-tv%2cc2_haberler%2cCct_sivas%2cCct_soguk-hava%2cCct_sicak-su%2cCct_buz%2cct_sivas%2cct_soguk-hava%2cct_sicak-su%2cct_buz%26pub_name%3dmilliyet","creatorType":"author"},{"name":"vpos=overlay","creatorType":"author"},{"name":"overlay=1\"}","creatorType":"author"},{"name":"{\"id\":\"postroll\"","creatorType":"author"},{"name":"\"offset\":\"end\"","creatorType":"author"},{"name":"\"type\":\"linear\"","creatorType":"author"},{"name":"\"url\":\"https://pubads.g.doubleclick.net/gampad/ads?sz=640x360","creatorType":"author"},{"name":"iu=/9927946/milliyet/sitegeneli/postroll","creatorType":"author"},{"name":"impl=s","creatorType":"author"},{"name":"gdfp_req=1","creatorType":"author"},{"name":"env=vp","creatorType":"author"},{"name":"output=vast","creatorType":"author"},{"name":"unviewed_position_start=1","creatorType":"author"},{"name":"url=https://www.milliyet.com.tr/gundem/canan-dagdeviren-kimdir-2392696","creatorType":"author"},{"name":"description_url=http%3a%2f%2fwww.milliyet.com.tr%2fgundem%2fcanan-dagdeviren-kimdir-2392696","creatorType":"author"},{"name":"correlator=","creatorType":"author"},{"name":"cust_params=keyword%3dVid_duration_1_3%2cVid_pubdate_new%2cseeding_false%2cautoplay_false%2csilentstart_false%2cst_none%2cpremium_video%26contentid%3d6142190%26kategori%3dml_mtv_milliyet-tv_haberler%26catlist%3dc1_milliyet-tv%2cc2_haberler%2cCct_sivas%2cCct_soguk-hava%2cCct_sicak-su%2cCct_buz%2cct_sivas%2cct_soguk-hava%2cct_sicak-su%2cct_buz%26pub_name%3dmilliyet","creatorType":"author"},{"name":"vpos=postroll\"}]","creatorType":"author"},{"name":"plugins:","creatorType":"author"},{"name":"stats: {gemius: {identifier: 'bIFA4t.SzzEb53fr9ZSQl2ZVzQXZZ4NyqW0wgJzlvwb.e7'}","creatorType":"author"},{"name":"Clicks: {portal: \"Webtv\"","creatorType":"author"},{"name":"Action: \"Video\"","creatorType":"author"},{"firstName":"pathname: \"O ilimizde hava eksi 21 dereceyi gördü! Hayat buz kesti | Haberler |","lastName":"sivas","creatorType":"author"},{"firstName":"Soğuk","lastName":"Hava","creatorType":"author"},{"firstName":"Sıcak","lastName":"Su","creatorType":"author"},{"firstName":"Buz | 117 |","lastName":"Newsdetail\"","creatorType":"author"},{"name":"newsCategory : '/milliyet-tv/haberler/'","creatorType":"author"},{"name":"Base_url: 'Https://Www.milliyet.com.tr/Milliyet-Tv/O-Ilimizde-Hava-Eksi-21-Dereceyi-Gordu-Hayat-Buz-Kesti-6142190'}","creatorType":"author"},{"name":"Bluekai: {}}","creatorType":"author"},{"firstName":"htvThumbnails: {showThumbnail:","lastName":"false","creatorType":"author"},{"name":"thumbnailUrl : '//videocdn.milliyet.com.tr/2020/02/11/mtv_6142190_thmb.jpg'","creatorType":"author"},{"name":"thumbnailWidth: '128'","creatorType":"author"},{"name":"thumbnailHeight: '72'}","creatorType":"author"},{"firstName":"hotkeys: {enableVolumeScroll:","lastName":"false}","creatorType":"author"},{"firstName":"suggestedVideos: {showSuggestedVideos:","lastName":"true","creatorType":"author"},{"name":"nextVideoSummonTime:7","creatorType":"author"},{"name":"autoNextSuggestedVideos:false","creatorType":"author"},{"firstName":"suggestedVideoList: null}});});O ilimizde hava eksi 21 dereceyi gördü! Hayat buz kestiSivas'ta gece saatlerinde termometreler eksi 21 dereceyi gösterdi Havaya serpilen sıcak su yere buz taneciği olarak düştü Sivas'ta günlerdir etkisini sürdüren soğuk hava gece yarısı eksi 21 dereceye düştü Hayat adeta buz","lastName":"kesti","creatorType":"author"},{"firstName":"Caddeler Tamamen Boşaldı daha Fazla Video","lastName":"Için","creatorType":"author"}],"tags":[],"title":"Canan Dağdeviren kimdir?","websiteTitle":"Milliyet","url":"https://www.milliyet.com.tr/gundem/canan-dagdeviren-kimdir-2392696","abstractNote":"Canan Dağdeviren kimdir? Dünyanın en iyi akademisyenlerini Boğaziçi Lectures kapsamında konuşmacı olarak misafir edecek olan Boğaziçi Üniversitesi Giyilebilir kalp pilinin mucidi Dr. Canan Dağdeviren'i konuk ediyor. Bilimsel anlamda birçok başarıya imza atan Canan Dağdeviren aynı zamanda Forbes dergisinin 30 yaş altı Bilim insanı listesinde de yer alıyor","language":"tr","accessDate":"2020-02-11T12:25:16Z"}]

This is a security issue we automatically block the edit, but it's not ideal.

I think I brought this up before, years ago, and it was recommended I file a bug in the translators repo instead, but I really think this is something that should be protected against here ideally as well! Bug for translators lib here: https://github.com/zotero/translators/issues/2117

dstillman commented 4 years ago

This isn't really something we can fix generally, but I've committed a change that, when paired with an updated EM translator, will mostly fix the above example. We can't fully emulate innerText in the browser, because it's not supported by JSDOM, but we can at least exclude script and style content when doing fallback author parsing (i.e., elements with byline or vcard classes, which have no actual meaning but usually indicate the presence of an author).

Note that this isn't really a security issue in any general sense. The issue here is just junk data in creator entries — what exactly it is shouldn't matter. translation-server returns text, not code. If it's being inserted into an HTML document, it should be properly escaped, as with any other untrusted input. (The one exception might be the few tags allowed by Zotero/citeproc-js, which it's possible some translators include in returned data, but that would be a strict whitelist.)