webhintio / hint

💡 A hinting engine for the web
https://webhint.io/
Apache License 2.0
3.63k stars 691 forks source link

[Bug] `fetch::end::html` and `traverse::*` use different logic to detect HTML #2067

Open antross opened 5 years ago

antross commented 5 years ago

Found because this broke some tests while trying to update hint-meta-charset-utf-8 to remove cheerio.

The logic also appears to be inconsistent between connectors, at least between connector-jsdom and utils-debugging-protocol-common. This should be fixed to use common helpers for both scenarios and across connectors.

Triggering fetch::end::html in connector-jsdom:

// Event is also emitted when status code in response is not 200.
await this.server.emitAsync(`fetch::end::${getType(mediaType!)}` as 'fetch::end::*', fetchEnd);

Triggering fetch::end::html in utils-debugging-protocol-common:

        let suffix = getType(response.mediaType);
        const defaults = ['unknown', 'xml'];

        if (isTarget && defaults.includes(suffix)) {
            suffix = 'html';
        }

        const eventName = `fetch::end::${suffix}` as 'fetch::end::*';

Both rely on the helper getType, but utils-debugging-protocol-common performs post-processing that isn't done anywhere else and treats more content as HTML. In getType itself, both text/html and application/xhtml+xml are accepted as HTML:

        case 'text/html':
        case 'application/xhtml+xml':
            return 'html';

Triggering a traverse in connector-jsdom and utils-debugging-protocol-common both use a different helper, isHTMLDocument, which treats anything from the filesystem as HTML regardless of type, but only accepts text/html for remote documents

/** Convenience function to check if a resource is a HTMLDocument. */
export default (targetURL: string, responseHeaders: HttpHeaders): boolean => {

    // If it's a local file, presume it's a HTML document.

    if (isLocalFile(targetURL)) {
        return true;
    }

    // Otherwise, check.

    const contentTypeHeaderValue: string | undefined = responseHeaders['content-type'];
    let mediaType: string;

    try {
        mediaType = parseContentTypeHeader(contentTypeHeaderValue || '').type;
    } catch (e) {
        return false;
    }

    return mediaType === 'text/html';
};
antross commented 5 years ago

Per discussion with @molant we should look at what mime-types are most common in the HTTP Archive to help determine what the correct fix is here.

We also need to look at how XML is handled by each of the connectors as we'd likely still want to traverse the DOM in that case so long as it gets parsed correctly.

antross commented 5 years ago

Also the implementation of getType used by the connectors to determine what type of fetch::end::* event to send is fairly narrow compared to what browsers support. For script it currently only includes text/javascript, but the HTML standard defines the possible list as:

antross commented 5 years ago

Additionally getType currently only returns xml for text/xml, but the HTML standard defines XML mime types as text/xml and any whose subtype ends in +xml (e.g. image/svg+xml).

antross commented 5 years ago

Also worth noting we use a getContentTypeData helper to determine what mime-type to associate with a request. Importantly this looks at the content of a request using the 3rd-party file-type library and overrides whatever was provided in the Content-Type header.

Notably this can cause us to treat any content starting with <?xml as XML, even if it was sent with the text/html mime type (this differs from browser behavior).

We should assess how to best align this with browser behavior as well.

molant commented 5 years ago

Per discussion with @molant we should look at what mime-types are most common in the HTTP Archive to help determine what the correct fix is here.

The top 100 mimetypes for the first requests are:

Row resp_content_type total
1 text/html; charset=UTF-8 1307794
2 text/html; charset=utf-8 688639  
3 text/html 452979  
4 text/html; charset=iso-8859-1 75231  
5 text/html;charset=UTF-8 62603  
6 application/ocsp-response 57046  
7 text/html;charset=utf-8 48526  
8 <--No mimetype, for real...  44948  
9 application/x-x509-ca-cert 17817  
10 text/html; charset=windows-1251 14021  
11 text/html; charset=ISO-8859-1 13346  
12 text/html;charset=ISO-8859-1 5888  
13 text/html; charset="UTF-8" 5700  
14 text/html; Charset=UTF-8 5327  
15 text/html; charset=EUC-JP 3455  
16 text/html; Charset=utf-8 3308  
17 text/html; charset="utf-8" 2617  
18 application/pkix-cert 2418  
19 text/html; charset=euc-kr 1878  
20 text/html; charset=cp1251 1501  
21 text/html; charset=WINDOWS-1251 1413  
22 text/html; charset=utf8 1247  
23 text/plain; charset=UTF-8 1199  
24 text/html; charset=windows-1252 1155  
25 text/html; charset=Shift_JIS 1148  
26 text/html; Charset=ISO-8859-1 906  
27 text/html; charset=windows-1250 865  
28 text/plain 865  
29 text/html; charset=iso-8859-2 824  
30 text/html; charset=gbk 759  
31 text/html; charset=iso-8859-15 695  
32 text/html; charset=Windows-1251 661  
33 text/html; charset=ISO-8859-15 502  
34 text/html; charset=CP1251 489  
35 text/html; charset=EUC-KR 476  
36 text/html; charset=ISO-8859-2 463  
37 text/html; charset=shift_jis 449  
38 text/plain; charset=utf-8 410  
39 text/html; charset=gb2312 363  
40 text/html;charset=iso-8859-1 355  
41 text/html; charset=ISO-8859-9 353  
42 text/html; charset=iso-8859-9 352  
43 text/html; Charset=iso-8859-1 333  
44 text/html; charset=euc-jp 317  
45 text/html; charset=windows-1256 284  
46 text/html; Charset=euc-kr 272  
47 text/html;charset=GBK 266  
48 text/html; charset=utf-8; 254  
49 text/html;charset=windows-1255 248  
50 text/html; charset=none 241  
51 application/octet-stream 237  
52 text/html; charset=tis-620 228  
53 text/html; charset=koi8-r 210  
54 text/html; charset=GBK 209  
55 text/html; charset=big5 199  
56 text/html; charset=UTF-8; 196  
57 application/x-pkcs7-certificates 195  
58 text/html; charset=Windows-1252 194  
59 text/html;charset=euc-kr 181  
60 text/html; charset=windows-1254 174  
61 text/html; charset=Big5 167  
62 text/html; charset=UTF8 164  
63 text/html;charset=Windows-31J 161  
64 text/html;charset=windows-1251 154  
65 httpd/unix-directory 151  
66 text/html; charset=GB2312 130  
67 text/html; Charset=windows-1254 129  
68 text/html; charset=SJIS 123  
69 text/html; charset=TIS-620 121  
70 text/html; charset=ISO-8859-1 120  
71 text/html;charset=EUC-KR 117  
72 text/html; Charset=windows-1252 110  
73 text/html; Charset=UTF-8;charset=UTF-8 110  
74 text/html;charset=utf-8; Charset=utf-8 101  
75 ;charset=UTF-8 99  
76 text/html;charset=windows-1252 97  
77 text/html; charset=windows-1255 95  
78 text/html; charset=US-ASCII 94  
79 text/html;charset=Shift_JIS 88  
80 text/html; charset= 82  
81 text/html; charset=windows-874 82  
82 txt/plain 80  
83 application/binary 78  
84 text/html; charset=UTF-8; dir=RTL 74  
85 text/html; ISO-8859-1 74  
86 text/html; charset=cp-1251 71  
87 text/html; Charset=UTF-8; charset=utf-8 70  
88 text/html;Charset=utf-8;charset=UTF-8 70  
89 text/html;charset=utf-8; 70  
90 text/html;charset=utf8 69  
91 text/html;Charset=utf-8 69  
92 text/html; charset=latin1 65  
93 text/html;; charset=UTF-8 63  
94 text/html; charset: iso-8859-1;charset=UTF-8 57  
95 text/html; charset=utf-8 56  
96 text/html; 55  
97 text/html; charset=0 55  
98 text/html;charset=gbk 54  
99 text/html; charset=ks_c_5601-1987 53  
100 text/html ; charset=UTF-8 52

The query I've used is

SELECT resp_content_type, count(resp_content_type) as total FROM `httparchive.latest.summary_requests_desktop`
  WHERE firstReq = true
  GROUP BY resp_content_type
  ORDER BY total desc

The results for first requests mime types: [ mimetype-all-requests-20190315.zip The results for all requests mime types: mimetype-first-requests-20190315.zip

Some things from that list that are worth talk about (from the PoV of first request):