[Bug] `fetch::end::html` and `traverse::*` use different logic to detect HTML

antross commented 5 years ago

Found because this broke some tests while trying to update hint-meta-charset-utf-8 to remove cheerio.

The logic also appears to be inconsistent between connectors, at least between connector-jsdom and utils-debugging-protocol-common. This should be fixed to use common helpers for both scenarios and across connectors.

Triggering fetch::end::html in connector-jsdom:

// Event is also emitted when status code in response is not 200.
await this.server.emitAsync(`fetch::end::${getType(mediaType!)}` as 'fetch::end::*', fetchEnd);

Triggering fetch::end::html in utils-debugging-protocol-common:

        let suffix = getType(response.mediaType);
        const defaults = ['unknown', 'xml'];

        if (isTarget && defaults.includes(suffix)) {
            suffix = 'html';
        }

        const eventName = `fetch::end::${suffix}` as 'fetch::end::*';

Both rely on the helper getType, but utils-debugging-protocol-common performs post-processing that isn't done anywhere else and treats more content as HTML. In getType itself, both text/html and application/xhtml+xml are accepted as HTML:

        case 'text/html':
        case 'application/xhtml+xml':
            return 'html';

Triggering a traverse in connector-jsdom and utils-debugging-protocol-common both use a different helper, isHTMLDocument, which treats anything from the filesystem as HTML regardless of type, but only accepts text/html for remote documents

/** Convenience function to check if a resource is a HTMLDocument. */
export default (targetURL: string, responseHeaders: HttpHeaders): boolean => {

    // If it's a local file, presume it's a HTML document.

    if (isLocalFile(targetURL)) {
        return true;
    }

    // Otherwise, check.

    const contentTypeHeaderValue: string | undefined = responseHeaders['content-type'];
    let mediaType: string;

    try {
        mediaType = parseContentTypeHeader(contentTypeHeaderValue || '').type;
    } catch (e) {
        return false;
    }

    return mediaType === 'text/html';
};

antross commented 5 years ago

Per discussion with @molant we should look at what mime-types are most common in the HTTP Archive to help determine what the correct fix is here.

We also need to look at how XML is handled by each of the connectors as we'd likely still want to traverse the DOM in that case so long as it gets parsed correctly.

antross commented 5 years ago

Also the implementation of getType used by the connectors to determine what type of fetch::end::* event to send is fairly narrow compared to what browsers support. For script it currently only includes text/javascript, but the HTML standard defines the possible list as:

application/ecmascript
application/javascript
application/x-ecmascript
application/x-javascript
text/ecmascript
text/javascript
text/javascript1.0
text/javascript1.1
text/javascript1.2
text/javascript1.3
text/javascript1.4
text/javascript1.5
text/jscript
text/livescript
text/x-ecmascript
text/x-javascript

antross commented 5 years ago

Additionally getType currently only returns xml for text/xml, but the HTML standard defines XML mime types as text/xml and any whose subtype ends in +xml (e.g. image/svg+xml).

antross commented 5 years ago

Also worth noting we use a getContentTypeData helper to determine what mime-type to associate with a request. Importantly this looks at the content of a request using the 3rd-party file-type library and overrides whatever was provided in the Content-Type header.

Notably this can cause us to treat any content starting with <?xml as XML, even if it was sent with the text/html mime type (this differs from browser behavior).

We should assess how to best align this with browser behavior as well.

molant commented 5 years ago

Per discussion with @molant we should look at what mime-types are most common in the HTTP Archive to help determine what the correct fix is here.

The top 100 mimetypes for the first requests are:

Row	resp_content_type	total
1	text/html; charset=UTF-8	1307794
2	text/html; charset=utf-8	688639
3	text/html	452979
4	text/html; charset=iso-8859-1	75231
5	text/html;charset=UTF-8	62603
6	application/ocsp-response	57046
7	text/html;charset=utf-8	48526
8	<--No mimetype, for real...	44948
9	application/x-x509-ca-cert	17817
10	text/html; charset=windows-1251	14021
11	text/html; charset=ISO-8859-1	13346
12	text/html;charset=ISO-8859-1	5888
13	text/html; charset="UTF-8"	5700
14	text/html; Charset=UTF-8	5327
15	text/html; charset=EUC-JP	3455
16	text/html; Charset=utf-8	3308
17	text/html; charset="utf-8"	2617
18	application/pkix-cert	2418
19	text/html; charset=euc-kr	1878
20	text/html; charset=cp1251	1501
21	text/html; charset=WINDOWS-1251	1413
22	text/html; charset=utf8	1247
23	text/plain; charset=UTF-8	1199
24	text/html; charset=windows-1252	1155
25	text/html; charset=Shift_JIS	1148
26	text/html; Charset=ISO-8859-1	906
27	text/html; charset=windows-1250	865
28	text/plain	865
29	text/html; charset=iso-8859-2	824
30	text/html; charset=gbk	759
31	text/html; charset=iso-8859-15	695
32	text/html; charset=Windows-1251	661
33	text/html; charset=ISO-8859-15	502
34	text/html; charset=CP1251	489
35	text/html; charset=EUC-KR	476
36	text/html; charset=ISO-8859-2	463
37	text/html; charset=shift_jis	449
38	text/plain; charset=utf-8	410
39	text/html; charset=gb2312	363
40	text/html;charset=iso-8859-1	355
41	text/html; charset=ISO-8859-9	353
42	text/html; charset=iso-8859-9	352
43	text/html; Charset=iso-8859-1	333
44	text/html; charset=euc-jp	317
45	text/html; charset=windows-1256	284
46	text/html; Charset=euc-kr	272
47	text/html;charset=GBK	266
48	text/html; charset=utf-8;	254
49	text/html;charset=windows-1255	248
50	text/html; charset=none	241
51	application/octet-stream	237
52	text/html; charset=tis-620	228
53	text/html; charset=koi8-r	210
54	text/html; charset=GBK	209
55	text/html; charset=big5	199
56	text/html; charset=UTF-8;	196
57	application/x-pkcs7-certificates	195
58	text/html; charset=Windows-1252	194
59	text/html;charset=euc-kr	181
60	text/html; charset=windows-1254	174
61	text/html; charset=Big5	167
62	text/html; charset=UTF8	164
63	text/html;charset=Windows-31J	161
64	text/html;charset=windows-1251	154
65	httpd/unix-directory	151
66	text/html; charset=GB2312	130
67	text/html; Charset=windows-1254	129
68	text/html; charset=SJIS	123
69	text/html; charset=TIS-620	121
70	text/html; charset=ISO-8859-1	120
71	text/html;charset=EUC-KR	117
72	text/html; Charset=windows-1252	110
73	text/html; Charset=UTF-8;charset=UTF-8	110
74	text/html;charset=utf-8; Charset=utf-8	101
75	;charset=UTF-8	99
76	text/html;charset=windows-1252	97
77	text/html; charset=windows-1255	95
78	text/html; charset=US-ASCII	94
79	text/html;charset=Shift_JIS	88
80	text/html; charset=	82
81	text/html; charset=windows-874	82
82	txt/plain	80
83	application/binary	78
84	text/html; charset=UTF-8; dir=RTL	74
85	text/html; ISO-8859-1	74
86	text/html; charset=cp-1251	71
87	text/html; Charset=UTF-8; charset=utf-8	70
88	text/html;Charset=utf-8;charset=UTF-8	70
89	text/html;charset=utf-8;	70
90	text/html;charset=utf8	69
91	text/html;Charset=utf-8	69
92	text/html; charset=latin1	65
93	text/html;; charset=UTF-8	63
94	text/html; charset: iso-8859-1;charset=UTF-8	57
95	text/html; charset=utf-8	56
96	text/html;	55
97	text/html; charset=0	55
98	text/html;charset=gbk	54
99	text/html; charset=ks_c_5601-1987	53
100	text/html ; charset=UTF-8	52

The query I've used is

SELECT resp_content_type, count(resp_content_type) as total FROM `httparchive.latest.summary_requests_desktop`
  WHERE firstReq = true
  GROUP BY resp_content_type
  ORDER BY total desc

The results for first requests mime types: [ mimetype-all-requests-20190315.zip The results for all requests mime types: mimetype-first-requests-20190315.zip

Some things from that list that are worth talk about (from the PoV of first request):

text/html and avriants of charset is the most common on, we should be covered I believe although we should make sure we parse things correctly because there's a lot of variants...
Mime type's position 8 is no mime type at all 😱
there are quite a few that require encryption (application/x-x509-ca-cert and application/pkix-cert)
The first time application/xhtml+xml; appears is in position 174 with 18 instances (if we remove the filter of first request then it drops to position 722), text/xml is in position 236 with 9 instances (and in this list there are almost 3 million first requests)

webhintio / hint

[Bug] `fetch::end::html` and `traverse::*` use different logic to detect HTML #2067