Open antross opened 5 years ago
Per discussion with @molant we should look at what mime-types are most common in the HTTP Archive to help determine what the correct fix is here.
We also need to look at how XML is handled by each of the connectors as we'd likely still want to traverse the DOM in that case so long as it gets parsed correctly.
Also the implementation of getType
used by the connectors to determine what type of fetch::end::*
event to send is fairly narrow compared to what browsers support. For script
it currently only includes text/javascript
, but the HTML standard defines the possible list as:
application/ecmascript
application/javascript
application/x-ecmascript
application/x-javascript
text/ecmascript
text/javascript
text/javascript1.0
text/javascript1.1
text/javascript1.2
text/javascript1.3
text/javascript1.4
text/javascript1.5
text/jscript
text/livescript
text/x-ecmascript
text/x-javascript
Additionally getType
currently only returns xml
for text/xml
, but the HTML standard defines XML mime types as text/xml
and any whose subtype ends in +xml
(e.g. image/svg+xml
).
Also worth noting we use a getContentTypeData
helper to determine what mime-type to associate with a request. Importantly this looks at the content of a request using the 3rd-party file-type
library and overrides whatever was provided in the Content-Type
header.
Notably this can cause us to treat any content starting with <?xml
as XML, even if it was sent with the text/html
mime type (this differs from browser behavior).
We should assess how to best align this with browser behavior as well.
Per discussion with @molant we should look at what mime-types are most common in the HTTP Archive to help determine what the correct fix is here.
The top 100 mimetypes for the first requests are:
Row | resp_content_type | total | |
---|---|---|---|
1 | text/html; charset=UTF-8 | 1307794 | |
2 | text/html; charset=utf-8 | 688639 | |
3 | text/html | 452979 | |
4 | text/html; charset=iso-8859-1 | 75231 | |
5 | text/html;charset=UTF-8 | 62603 | |
6 | application/ocsp-response | 57046 | |
7 | text/html;charset=utf-8 | 48526 | |
8 | <--No mimetype, for real... | 44948 | |
9 | application/x-x509-ca-cert | 17817 | |
10 | text/html; charset=windows-1251 | 14021 | |
11 | text/html; charset=ISO-8859-1 | 13346 | |
12 | text/html;charset=ISO-8859-1 | 5888 | |
13 | text/html; charset="UTF-8" | 5700 | |
14 | text/html; Charset=UTF-8 | 5327 | |
15 | text/html; charset=EUC-JP | 3455 | |
16 | text/html; Charset=utf-8 | 3308 | |
17 | text/html; charset="utf-8" | 2617 | |
18 | application/pkix-cert | 2418 | |
19 | text/html; charset=euc-kr | 1878 | |
20 | text/html; charset=cp1251 | 1501 | |
21 | text/html; charset=WINDOWS-1251 | 1413 | |
22 | text/html; charset=utf8 | 1247 | |
23 | text/plain; charset=UTF-8 | 1199 | |
24 | text/html; charset=windows-1252 | 1155 | |
25 | text/html; charset=Shift_JIS | 1148 | |
26 | text/html; Charset=ISO-8859-1 | 906 | |
27 | text/html; charset=windows-1250 | 865 | |
28 | text/plain | 865 | |
29 | text/html; charset=iso-8859-2 | 824 | |
30 | text/html; charset=gbk | 759 | |
31 | text/html; charset=iso-8859-15 | 695 | |
32 | text/html; charset=Windows-1251 | 661 | |
33 | text/html; charset=ISO-8859-15 | 502 | |
34 | text/html; charset=CP1251 | 489 | |
35 | text/html; charset=EUC-KR | 476 | |
36 | text/html; charset=ISO-8859-2 | 463 | |
37 | text/html; charset=shift_jis | 449 | |
38 | text/plain; charset=utf-8 | 410 | |
39 | text/html; charset=gb2312 | 363 | |
40 | text/html;charset=iso-8859-1 | 355 | |
41 | text/html; charset=ISO-8859-9 | 353 | |
42 | text/html; charset=iso-8859-9 | 352 | |
43 | text/html; Charset=iso-8859-1 | 333 | |
44 | text/html; charset=euc-jp | 317 | |
45 | text/html; charset=windows-1256 | 284 | |
46 | text/html; Charset=euc-kr | 272 | |
47 | text/html;charset=GBK | 266 | |
48 | text/html; charset=utf-8; | 254 | |
49 | text/html;charset=windows-1255 | 248 | |
50 | text/html; charset=none | 241 | |
51 | application/octet-stream | 237 | |
52 | text/html; charset=tis-620 | 228 | |
53 | text/html; charset=koi8-r | 210 | |
54 | text/html; charset=GBK | 209 | |
55 | text/html; charset=big5 | 199 | |
56 | text/html; charset=UTF-8; | 196 | |
57 | application/x-pkcs7-certificates | 195 | |
58 | text/html; charset=Windows-1252 | 194 | |
59 | text/html;charset=euc-kr | 181 | |
60 | text/html; charset=windows-1254 | 174 | |
61 | text/html; charset=Big5 | 167 | |
62 | text/html; charset=UTF8 | 164 | |
63 | text/html;charset=Windows-31J | 161 | |
64 | text/html;charset=windows-1251 | 154 | |
65 | httpd/unix-directory | 151 | |
66 | text/html; charset=GB2312 | 130 | |
67 | text/html; Charset=windows-1254 | 129 | |
68 | text/html; charset=SJIS | 123 | |
69 | text/html; charset=TIS-620 | 121 | |
70 | text/html; charset=ISO-8859-1 | 120 | |
71 | text/html;charset=EUC-KR | 117 | |
72 | text/html; Charset=windows-1252 | 110 | |
73 | text/html; Charset=UTF-8;charset=UTF-8 | 110 | |
74 | text/html;charset=utf-8; Charset=utf-8 | 101 | |
75 | ;charset=UTF-8 | 99 | |
76 | text/html;charset=windows-1252 | 97 | |
77 | text/html; charset=windows-1255 | 95 | |
78 | text/html; charset=US-ASCII | 94 | |
79 | text/html;charset=Shift_JIS | 88 | |
80 | text/html; charset= | 82 | |
81 | text/html; charset=windows-874 | 82 | |
82 | txt/plain | 80 | |
83 | application/binary | 78 | |
84 | text/html; charset=UTF-8; dir=RTL | 74 | |
85 | text/html; ISO-8859-1 | 74 | |
86 | text/html; charset=cp-1251 | 71 | |
87 | text/html; Charset=UTF-8; charset=utf-8 | 70 | |
88 | text/html;Charset=utf-8;charset=UTF-8 | 70 | |
89 | text/html;charset=utf-8; | 70 | |
90 | text/html;charset=utf8 | 69 | |
91 | text/html;Charset=utf-8 | 69 | |
92 | text/html; charset=latin1 | 65 | |
93 | text/html;; charset=UTF-8 | 63 | |
94 | text/html; charset: iso-8859-1;charset=UTF-8 | 57 | |
95 | text/html; charset=utf-8 | 56 | |
96 | text/html; | 55 | |
97 | text/html; charset=0 | 55 | |
98 | text/html;charset=gbk | 54 | |
99 | text/html; charset=ks_c_5601-1987 | 53 | |
100 | text/html ; charset=UTF-8 | 52 |
The query I've used is
SELECT resp_content_type, count(resp_content_type) as total FROM `httparchive.latest.summary_requests_desktop`
WHERE firstReq = true
GROUP BY resp_content_type
ORDER BY total desc
The results for first requests mime types: [ mimetype-all-requests-20190315.zip The results for all requests mime types: mimetype-first-requests-20190315.zip
Some things from that list that are worth talk about (from the PoV of first request):
text/html
and avriants of charset is the most common on, we should be covered I believe although we should make sure we parse things correctly because there's a lot of variants...application/x-x509-ca-cert
and application/pkix-cert
)application/xhtml+xml;
appears is in position 174 with 18 instances (if we remove the filter of first request then it drops to position 722), text/xml
is in position 236 with 9 instances (and in this list there are almost 3 million first requests)
Found because this broke some tests while trying to update
hint-meta-charset-utf-8
to removecheerio
.The logic also appears to be inconsistent between connectors, at least between
connector-jsdom
andutils-debugging-protocol-common
. This should be fixed to use common helpers for both scenarios and across connectors.Triggering
fetch::end::html
inconnector-jsdom
:Triggering
fetch::end::html
inutils-debugging-protocol-common
:Both rely on the helper
getType
, bututils-debugging-protocol-common
performs post-processing that isn't done anywhere else and treats more content as HTML. IngetType
itself, bothtext/html
andapplication/xhtml+xml
are accepted as HTML:Triggering a traverse in
connector-jsdom
andutils-debugging-protocol-common
both use a different helper,isHTMLDocument
, which treats anything from the filesystem as HTML regardless of type, but only acceptstext/html
for remote documents