whatwg / encoding

Encoding Standard
https://encoding.spec.whatwg.org/
Other
279 stars 77 forks source link

Add UTF-7 to replacement encoding list? / Encoding sniffing #68

Open jungshik opened 8 years ago

jungshik commented 8 years ago

Maybe this was discussed before, but I couldn't find a bug on this. What do you think of treating UTF-7 the same way as ISO-2022-{KR,CN}, HZ-GB, etc?

When decoding, the whole input is replaced by U+FFFD. When encoding, use UTF-8.

Background: Blink began to use Compact Encoding Detector ( google/compact_enc_det ) when no encoding label is found (http, meta). When 7-bit encoding detection is on, it detects ISO-2022-{KR,CN}, HZ-GB AND UTF-7 in addition to ISO-2022-JP. 7-bit encoding detection is ON for ISO-2022-JP, but we want to suppress other 7-bit encodings. I think the best way to 'suppress' (unsupport) them is to turn the whole input to U+FFFD.

annevk commented 8 years ago

If that is compatible, sounds fine. I know @hsivonen wants to greatly extend the list of replacement labels with all kinds of legacy encodings that might cause problems. Main issue is doing the research to see whether it is safe.

hsivonen commented 8 years ago

Blink began to use Compact Encoding Detector ( google/compact_enc_det ) when no encoding label is found (http, meta).

Whoa! Do you mean Blink now uses more unspecified heuristics than before even for non-Japanese locales? Why more heuristics? Why without a spec? ಠ_ಠ

When 7-bit encoding detection is on, it detects ISO-2022-{KR,CN}, HZ-GB AND UTF-7 in addition to ISO-2022-JP. 7-bit encoding detection is ON for ISO-2022-JP, but we want to suppress other 7-bit encodings. I think the best way to 'suppress' (unsupport) them is to turn the whole input to U+FFFD.

It seems to me that it's bad for Blink to adopt a library that doesn't do what Blink needs to do as a black box and then tweak the output of the black box as opposed to writing code that does what's needed in the first place.

As for the issue of mapping the label "utf-7" to the replacement encoding generally, I think it's a matter of determining if real Web content relies on the label "utf-7" being unrecognized so as to fall back on an ASCII-compatible encoding.

If Blink is willing to experiment with shipping a mapping from "utf-7" to replacement and assessing breakage, that would be cool.

hsivonen commented 8 years ago

Also, does existing Web content actually require ISO-2022-JP to be heuristically detected (as opposed to heuristically detecting between Shift_JIS and EUC-JP)? Wouldn't in be a security improvement not to detect ISO-2022-JP and not to let it be chosen from the menu (i.e. support it only when labeled)?

hsivonen commented 8 years ago

Also, letting a detector decide that input should be decoded as the replacement encoding is most likely a bad idea. Consider an unlabeled site whose markup template and site-provided content are all ASCII and that accepts comments. A user could cause denial of service by submitting a comment that looks like a UTF-7 or HZ byte sequence.

jungshik commented 8 years ago

Chromium got rid of the encoding menu completely. So, the encoding cannot be chosen via UI any more.

Blink did not add a new heuristic, but just replaced ICU encoding detector with Compact Encoding Detector. It only kicks in when there is no encoding label (meta / http C-T charset).

annevk commented 8 years ago

@jungshik if the web requires such a detector, we should work towards standardizing it. Would Google be willing to work with us on that?

jungshik commented 8 years ago

FYI, CED is open-sourced and available in github. See https://github.com/google/compact_enc_det

BTW, Blink was changed recently to treat auto-detected (unlabelled) 7-bit encodings (other than ISO-2022-JP) as ASCII. We came across @hsivonen's scenario ( https://github.com/whatwg/encoding/issues/68#issuecomment-239099089 )

jungshik commented 8 years ago

does existing Web content actually require ISO-2022-JP to be heuristically detected Wouldn't in be a security improvement not to detect ISO-2022-JP ?

Both are good questions. I've just filed https://bugs.chromium.org/p/chromium/issues/detail?id=647582

annevk commented 8 years ago

And by "as ASCII" I guess you mean windows-1252?

hsivonen commented 8 years ago

@jungshik, I have some more questions:

(Also, "is open-sourced and available in github" doesn't really answer @annevk's question about willingness to work towards standardizing.)

JinsukKim commented 7 years ago

Realized these questions better be answered by the person who ported the new detector to Blink (which is me):

Did you explore guessing the encoding from the top-level domain name instead of guessing it from content?

TLD is utilized in the way that it is used to set the default encoding which is chosen for the encoding of the document if auto-detection fails to work or doesn't get triggered for whatever reason. So it is involved in the overall decoding process but not exactly in detection step itself.

How many bytes do you feed to the detector?

To be precise, that's out of the scope of the detector but rather how Blink decoder is designed. It feeds the first chunk of the document received from network, which in almost most cases big enough for the detector to come up with the right guess.

What happens if the network stalls before that many bytes have been received? Do you wait or do you stop detecting based on a timer? (Timers have their own problems, and Firefox got rid of the timer in Firefox 4.) Once the detector has seen as many bytes as you usually feed it, do you continue feeding data to the detector to allow it to revise its guess? If yes, what do you do if the detector revises its guess?

As answered above, Blink is designed to feed the detector with the first chunk of the data only. After that, no more data is fed to it, therefore no revising its guess either. There is a way to artificially force the first chunk to be small so as not to have significant part of the document content but only have tags and scripts that can be viewed in ascii, which will trick the detector into thinking it is in plain ascii. But it would rarely happen in real life. One possible bug I'm aware of it that it is possible that a document is written in plain ascii in most of its text but has a couple of symbols in UTF-8, say, at the end of it. The detector guesses it is in ascii because it never sees the symbols if it is not in the first chunk. https://bugs.chromium.org/p/chromium/issues/detail?id=233002

Other than ISO-2022-JP (per your previous comment) and, I'm extrapolating, EUC-JP, can the detector (as used in Blink) guess an encoding that wasn't previously the default for any Chrome localization? That is, does the detector make pages "work" when they previously didn't for any localization defaults?

Not sure if I got the question right - AFAICT, Blink respects the detected encoding and uses it for decoding documents regardless of the localization default.

Can the detector (as used in Blink) guess UTF-8? That is, can Blink's use of the detector inadvertently encourage the non-labeling of newly-created UTF-8 content?

Yes it can also detect UTF-8. I wouldn't view it 'inadvertently encourage' non-labelling though. Most of the modern web sites go with UTF-8, and label the documents well. I don't have the stats on how it actually contributes to them neglecting labelling properly. The detector is more for legacy web sites.

hsivonen commented 7 years ago

Realized these questions better be answered by the person who ported the new detector to Blink

Thank you for the answer.

TLD is utilized in the way that it is used to set the default encoding which is chosen for the encoding of the document if auto-detection fails to work or doesn't get triggered for whatever reason.

Have you estimated the success rate of looking at the TLD only versus looking at the first network buffer and falling back to looking at the TLD? How significant a win is it to look inside the first network buffer when you have the capability to look at the TLD?

Do you guess windows-1252 for the new TLDs, like Gecko does?

It feeds the first chunk of the document received from network, which in almost most cases big enough for the detector to come up with the right guess.

I'm quite unhappy about Blink reintroducing a dependency on network buffer boundaries in text/html handling without discussion at the WHATWG prior to making the change.

For historical context, prior to Firefox 4 text/html handling in Firefox was dependent on where in the input stream the buffer boundaries fell. This was bad, and getting rid of this, thanks to the inside from WebKit, was one of the accomplishments of the HTML parsing standardization effort at the WHATWG.

There is a way to artificially force the first chunk to be small so as not to have significant part of the document content but only have tags and scripts that can be viewed in ascii, which will trick the detector into thinking it is in plain ascii. But it would rarely happen in real life.

Unfortunately, the Web is vast enough for rare cases to be real.

One possible bug I'm aware of it that it is possible that a document is written in plain ascii in most of its text but has a couple of symbols in UTF-8, say, at the end of it. The detector guesses it is in ascii because it never sees the symbols if it is not in the first chunk.

When the guess is "ASCII", does it result in a windows-1252 decoder being instantiated? Or the decoder being chosen based on the TLD? Or something else?

https://bugs.chromium.org/p/chromium/issues/detail?id=233002

Even if the HTML parser has only seen ASCII by the time it revises its encoding detection guess, the encoding may already have been inherited to CSS, JS and iframes, so revising the guess without causing a full reload seems like a bad idea.

Other than ISO-2022-JP (per your previous comment) and, I'm extrapolating, EUC-JP, can the detector (as used in Blink) guess an encoding that wasn't previously the default for any Chrome localization? That is, does the detector make pages "work" when they previously didn't for any localization defaults?

Not sure if I got the question right - AFAICT, Blink respects the detected encoding and uses it for decoding documents regardless of the localization default.

I meant that there is a set of encodings that were the default in some localization and a set of encodings that were not to the default in any localization. When the guess is from the first set, the behavior that ensues matches the behavior of some pre-existing default configuration. When the guess is from the second set, a new Web-exposed behavior has been introduced. Apart from Japanese encodings, does Blink ever guess an encoding from the second set (e.g. ISO-8859-13)?

Yes it can also detect UTF-8. I wouldn't view it 'inadvertently encourage' non-labelling though.

The vast majority of Web authors have not read your comment here and don't know how Blink works. Since Web authors do what seem to work, they don't know that UTF-8 sniffing is unreliable until it fails them. As noted in your comment about the case where the first network buffer is all ASCII, UTF-8 sniffing in Blink is unreliable. Some Web authors are going to feel free not to label UTF-8 as long as they haven't experienced the failure yet (i.e. they only test in Blink and e.g. use a language that makes it probable for <title> to have some non-ASCII). This is bad.

(Detecting UTF-8 for file:-URL HTML by looking at the entire file is fine, IMO, and I want to implement that in a Gecko in due course.)

JinsukKim commented 7 years ago

Please note that I'm not working on Blink. And I ported CED as a drop-in replacement of ICU encoding detector for any Blink-based Web browser to benefit from it. I'm not entirely familiar with all the terms used in your questions.

Do you guess windows-1252 for the new TLDs, like Gecko does?

What are 'new' TLDs? What I can say is Blink makes use of a typical set of TLDs and mappings to default encoding which can be found in Chromium/Android resource. Mapping for countries whose legacy, default encoding is not obvious (ge-Georgian, id-Indonesian, for instance) are not included.

When the guess is "ASCII", does it result in a windows-1252 decoder being instantiated? Or the decoder being chosen based on the TLD? Or something else?

Yes it goes with window-1252.

I meant that there is a set of encodings that were the default in some localization and a set of encodings that were not to the default in any localization. When the guess is from the first set, the behavior that ensues matches the behavior of some pre-existing default configuration. When the guess is from the second set, a new Web-exposed behavior has been introduced. Apart from Japanese encodings, does Blink ever guess an encoding from the second set (e.g. ISO-8859-13)?

I don't know what belongs to the second set. Maybe you can see it for yourself here about what the detector can return: https://cs.chromium.org/chromium/src/third_party/ced/src/compact_enc_det/compact_enc_det_unittest.cc?l=4733

The vast majority of Web authors have not read your comment here and don't know how Blink works. Since Web authors do what seem to work, they don't know that UTF-8 sniffing is unreliable until it fails them. As noted in your comment about the case where the first network buffer is all ASCII, UTF-8 sniffing in Blink is unreliable. Some Web authors are going to feel free not to label UTF-8 as long as they haven't experienced the failure yet (i.e. they only test in Blink and e.g. use a language that makes it probable for to have some non-ASCII). This is bad.</p> </blockquote> <p>I can see your point. What would be your suggestion?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/JinsukKim"><img src="https://avatars.githubusercontent.com/u/24905520?v=4" />JinsukKim</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>I missed this:</p> <blockquote> <p>Have you estimated the success rate of looking at the TLD only versus looking at the first network buffer and falling back to looking at the TLD? How significant a win is it to look inside the first network buffer when you have the capability to look at the TLD?</p> </blockquote> <p>Unfortunately I don't have a good estimation on that. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/hsivonen"><img src="https://avatars.githubusercontent.com/u/478856?v=4" />hsivonen</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>What are 'new' TLDs?</p> </blockquote> <p>I mean TLDs that are neither the ancient ones (.com, .org, .net, .edu, .gov, .mil) nor country TLDs. That is, TLDs like .mobi, .horse, .xyz, .goog, etc.</p> <blockquote> <p>What I can say is Blink makes use of a typical set of TLDs and mappings to default encoding which can be found in Chromium/Android resource.</p> </blockquote> <p>Where can I find these mappings? (I searched the Chromium codebase for Punycode country TLDs but didn't find this kind of mapping for those.)</p> <blockquote> <blockquote> <p>When the guess is "ASCII", does it result in a windows-1252 decoder being instantiated? Or the decoder being chosen based on the TLD? Or something else?</p> </blockquote> <p>Yes it goes with window-1252.</p> </blockquote> <p>What kind of situation is the TLD used in, then?</p> <blockquote> <p>I don't know what belongs to the second set.</p> </blockquote> <p>The second set is:</p> <ul> <li>EUC-JP</li> <li>ISO-2022-JP</li> <li>IBM866 </li> <li>ISO-8859-3</li> <li>ISO-8859-4</li> <li>ISO-8859-5</li> <li>ISO-8859-6</li> <li>ISO-8859-8</li> <li>ISO-8859-8-I</li> <li>ISO-8859-10</li> <li>ISO-8859-13</li> <li>ISO-8859-14</li> <li>ISO-8859-15</li> <li>ISO-8859-16</li> <li>KOI8-R</li> <li>KOI8-U</li> <li>macintosh</li> <li>x-mac-cyrillic</li> <li>x-user-defined</li> <li>gb18030</li> </ul> <p>(In Gecko, the Japanese and Cyrillic encodings are potential detector outcomes, but my current belief (educated guess, not based on proper data) is that the Cyrillic detectors do more harm than good in Gecko, and I want to remove them. Also, ISO-2022-JP has a structure that's known-dangerous on a general level and shown-dangerous for other encodings with similar structure, and ISO-2022-JP is just waiting for someone to demonstrate an attack. In Gecko, the Japanese detector is enabled for the Japanese locale, the Russian detector is enabled for the Russian locale and the Ukrainian detector is enabled for the Ukrainian locale. Other localizations, including Cyrillic ones, ship with detector off.)</p> <blockquote> <p>Maybe you can see it for yourself here about what the detector can return: <a href="https://cs.chromium.org/chromium/src/third_party/ced/src/compact_enc_det/compact_enc_det_unittest.cc?l=4733">https://cs.chromium.org/chromium/src/third_party/ced/src/compact_enc_det/compact_enc_det_unittest.cc?l=4733</a></p> </blockquote> <p>So the answer is that it can return non-Japanese and even non-Cyrillic guesses from the second set. :-(</p> <blockquote> <blockquote> <p>The vast majority of Web authors have not read your comment here and don't know how Blink works. Since Web authors do what seem to work, they don't know that UTF-8 sniffing is unreliable until it fails them. As noted in your comment about the case where the first network buffer is all ASCII, UTF-8 sniffing in Blink is unreliable. Some Web authors are going to feel free not to label UTF-8 as long as they haven't experienced the failure yet (i.e. they only test in Blink and e.g. use a language that makes it probable for <title> to have some non-ASCII). This is bad.</p> </blockquote> <p>I can see your point. What would be your suggestion?</p> </blockquote> <p>Never guessing UTF-8.</p> <blockquote> <blockquote> <p>Have you estimated the success rate of looking at the TLD only versus looking at the first network buffer and falling back to looking at the TLD? How significant a win is it to look inside the first network buffer when you have the capability to look at the TLD?</p> </blockquote> <p>Unfortunately I don't have a good estimation on that.</p> </blockquote> <p>That seems like the fundamental question when it comes to assessing whether the detector is necessary complexity assuming the absence of the manual override menu or if the detector adds unnecessary complexity to the Web Platform.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/yuhong"><img src="https://avatars.githubusercontent.com/u/899343?v=4" />yuhong</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>BTW, ISO-2022-JP don't need to be "heuristically" detected. ISO 2022 generally uses special escape characters that denote the encoding in use.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/annevk"><img src="https://avatars.githubusercontent.com/u/1544111?v=4" />annevk</a> commented <strong> 7 years ago</strong> </div> <div class="markdown-body"> <p>We've made some progress on this issue in <a href="https://bugs.chromium.org/p/chromium/issues/detail?id=691985">https://bugs.chromium.org/p/chromium/issues/detail?id=691985</a>. We haven't quite reached agreement though on what level of sniffing and rules are necessary for everyone to implement.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/hsivonen"><img src="https://avatars.githubusercontent.com/u/478856?v=4" />hsivonen</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>It feeds the first chunk of the document received from network</p> </blockquote> <p>FWIW, this is <em>so</em> dependent on IO implementation details that the sniffing limit is different for file: URLs on Ubuntu/ext4 vs. Windows/NTFS. (64 KB vs. something less that 64 KB, respectively, in my testing.)</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/hsivonen"><img src="https://avatars.githubusercontent.com/u/478856?v=4" />hsivonen</a> commented <strong> 5 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>FYI, CED is open-sourced and available in github. See <a href="https://github.com/google/compact_enc_det">https://github.com/google/compact_enc_det</a></p> </blockquote> <p>I did a brief evaluation of <a href="https://github.com/google/compact_enc_det">https://github.com/google/compact_enc_det</a> to form an opinion on whether we should adopt it for Gecko. I feel very uneasy about adopting it, mainly because a) it seems bad for the Web Platform to depend on a bunch of mystery C++ as opposed to having a spec with multiple independent interoperable implementations, b) it indeed is mystery C++ in the sense that it doesn't have sufficient comments explaining its design or the rationale behind its design, c) it has a bunch of non-browser motivated code that'd be dead code among the mystery C++ code.</p> <p>In case we want to develop a spec for sniffing and encoding from content, here are some notes.</p> <p>It's clear that compact_enc_det has not been developed for the purpose that it is used for in Chromium. It looks like it's been developed for the Google search engine and possibly also for Gmail. As seen from this issue, this results in compact_enc_det doing things that don't really make sense in a Web browser and then having to correct toward something that makes more sense in the browser context.</p> <p>compact_enc_det does not make use of data tables and algorithms that an implementation of the Encoding Standard would necessarily include. If I was designing a detector, I'd eliminate encodings whenever decoding according to the Encoding Standard according to a candidate encoding would result in error or in C1 controls (also probably half-width katakana, as that's a great way to decide that the encoding is not Shift_JIS).</p> <p>compact_enc_det analyzes byte bigrams both for single-byte encodings and for legacy CJK encodings. In an implementation that assumes the presence of the Encoding Standard data and algorithms, it should be easy to distinguish between single-byte encodings and legacy CJK encodings by the process of elimination according to the above paragraph: If a legacy CJK encoding remains on the candidate list after seeing a couple of hundred non-ASCII bytes, chances are that the input is in a legacy CJK encoding, because it's so likely that single-byte-encoding content results in a errors when decoded as legacy CJK encodings. (Big5 and Shift_JIS are clearly structurally different from the EUC-JP, EUC-KR and GBK. The decision among the EUC-based encodings, if input is valid according to all of them, could be done as follows: If there's substantial kana according to EUC-JP interpretation, it's EUC-JP. If the two-byte sequences stay mainly in the original KS X 1001 Hangul area, it's EUC-KR. Otherwise, it's GBK.)</p> <p>compact_enc_det does not assign probabilities to bigrams directly. Instead, it takes the score for the first byte of an "aligned bigram" with the high bit XORed with the high bit of the second byte and the score for the second byte of an "aligned bigram". If the combination of the highest four bits of the first byte and the highest four bits of the second byte yields a marker for a more precise lookup, then a score for the low 5 bits of the first byte combined with the low 5 bits of the second byte is taken. It's unclear why that score isn't used unconditionally. "Aligned" appears to mean aligned to the start of the non-ASCII run, which appears to align for legacy CJK, and the next bigram consist of the third and fourth byte of the run instead of the second and third bite of the run, which makes sense for legacy CJK, but at least superficially looks like an odd choice for analyzing non-Latin single-byte encodings.</p> <p>I am guessing that "compact" in the name comes from not having a table of ((256 <em> 256) / 4) </em> 3 (all byte pairs where at least one byte is non-ASCII) per encoding and instead having the 256-entry table indexed by the combination of four and four high bits to decide if a table of 1024 entries indexed by five and five low bits should be used.</p> <p>There seems to be a capability for distinguishing between Western, Central European, and Baltic encodings using trigrams, but it's unclear if that code is actually turned on.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>