Big5 encoding mishandles some trailing bytes, with possible XSS

mihnita commented 5 years ago

There are some sequences of bytes that are valid lead-trailing according to the description at https://encoding.spec.whatwg.org/#big5-decoder, but don't have a corresponding Unicode codepoint in the index-big5.txt mapping table.

In this case the first byte is converted to U+FFFD, but the second one is left "as is". In some cases that second byte can be backslash (\, 5C), which can be used to "escape" the ending quote of strings in JavaScript and potentially resulting in XSS exploits.

Example (attached): the 83 5C sequence According to the algorithm at https://encoding.spec.whatwg.org/#big5-decoder

the first byte is a lead byte (case 5, byte is in the range 0x81 to 0xFE, inclusive)
for the second byte we have case 3
- 3.1. Let offset be 0x40 (byte is less than 0x7F)
- 3.2. byte is in the range 0x40 to 0x7E, inclusive => set pointer to (lead − 0x81) × 157 + (byte − offset). The result of that is (0x83 - 0x81) * 157 + (0x5C - 0x40) which is 0x156
- there is no mapping for 0x156 in index-big5.txt, and because the byte is an ASCII byte we prepend byte to stream (case 3.6) and Return error (case 3.7)

The end result is a U+FFFD (from the error) followed by a 5C (the trailing byte, "as is")

You can see this in the attached file. When opened in both Chrome and Firefox the text is rendered as the "Unicode REPLACEMENT CHARACTER" (correct) followed by a back-slash (incorrect).

This is a valid lead-trail byte sequence that should either be replaced by one single U+FFFD character, or by two U+FFFD characters, whatever the policy is (I think the second case).

But the definitely the trailing byte should not be left "as is"

The possible exploit can use the trailing byte (which is backslash) to escape the end of a string, for example. Checking the console of Firefox you will see the 'SyntaxError: "" string literal contains an unescaped line break' message. In Chrome the message is 'Uncaught SyntaxError: Invalid or unexpected token'

I did not check, but this might also happen in other DBCS (Double Byte Characters Sets) that have the second byte in the ASCII range (for instance in Shift JIS?).

big5.zip

annevk commented 5 years ago

Heya, welcome!

This is by design to avoid allowing attackers to use this to prevent ASCII delimiters from being seen.

I took some time to find the rationale to make sure we're all on the same page going forward:

mihnita commented 5 years ago

Thank you for the quick answer.

==== TLDR ====

82 22 which is an invalid Shift JIS byte sequence. 83 5C is a valid Big 5 sequence, with not Unicode mapping.

Because the 83 5C is a valid sequence but has no mapping, the whole sequence should be replaced with U+FFFD (one or two of them, TBD), not just the lead byte.

==== The long version ====

I have tried to carefully read the links provided, but I think this is kind of the opposite... The bugs are about 82 22 which is indeed an invalid bytesequence for Shift_JIS.

That handling is correct, because it is a 'valid lead' followed by 'invalid trailing'

If byte is in the range 0x40 to 0x7E, inclusive, or 0x80 to 0xFC, inclusive, set pointer to (lead − lead offset) × 188 + byte − offset. 0x22 is not in the [0x40, 0x7E] range

This is different, as 83 5C is a valid Big 5 sequence, valid lead + valid trailing, but no Unicode mapping, at least not in the table included with the spec (index-big5.txt).

This can lead the an exploit similar to the one quoted in the bug (from 2011):

a shift_jis lead byte 0x82 was used to “mask” a 0x22 trail byte in a JSON resource of which an attacker could control some field. The producer did not see the problem even though this is an illegal byte combination.

In this case 83 5C can be used to mask a 0x22 (double quote) coming AFTER 5C

One might not see the problem, because 83 5C is a valid Big 5 byte sequence, So 83 5C 22 maps (sometimes :-) to U+F00E" (PUA followed by a double quote). And with some other mappings is converted to U+FFFD\" which escapes the quote.

To make things more interesting that sequence is actually mapped to a PUA character (U+F00E) in the Microsoft implementation (https://en.wikipedia.org/wiki/Code_page_950) Not really relevant to the bug...

Considering that there are many Big 5 extensions (and tables), this is quite likely to happen. (https://en.wikipedia.org/wiki/Big5). And most don't even have a IANA registration.

In fact, I discovered this exactly from a client side JSON parsing (using this algorithm), with data produced server side (using the Big 5-HKSCS table by default).

None of these extensions are registered with IANA, so there is no standard way to communicate that information to another client.

annevk commented 5 years ago

Thanks for that analysis, I guess that is indeed a novel angle that I'm not sure was fully considered. That would affect most legacy encodings to some extent, including Shift_JIS I suspect (we tend to unwind for ASCII bytes as a general principle).

Since most implementations are now aligned I wonder to what extent they want to change this again.

At the very least we should add a warning somewhere or maybe add a paragraph to the security considerations that suggests that if you're using a legacy encoding, you have to be sure that it's identical to those defined and that otherwise you need to account for the difference in behavior being exploited.

mihnita commented 5 years ago

Since most implementations are now aligned I wonder to what extent they want to change this again. ... At the very least we should add a warning somewhere

If there is agreement that this is a possible security problem waiting to happen, then a fix is a lot better than a warning. Plus, it really "feels" incorrect: a valid lead-trailing byte sequence gets only half-converted to U+FFFD...

It is not an intrinsic problem with legacy encodings, it is a problem with the algorithm as described in this spec.

Let's say we add a warning. What is an implementer supposed to do to mitigate the security problem? They can either implement some fix, making them non-WATWG compliant, or they can ignore it. None of the options sounds like a good one.

"you have to be sure that it's identical to those defined": there is no way to do that. You get some content from somewhere, you don't even control it, and it is tagged as Big 5. There is no way to tag it as some variant of Big 5, because none of them is IANA registered. The only thing one can do is detect the error and convert the invalid sequences to FFFD, which is what this spec is supposed to detail.

Would it help if I get some "chime in" from someone in the Chrome team saying they would implement this if the spec changes?

Thank you very much, Mihai

annevk commented 5 years ago

The warning would be for content producers, who also must use UTF-8 per the standard already, so they are already in violation of sorts.

If Chrome were willing to drive this work (including the change proposals and test changes required) that would certainly help, but we'd need one more implementer (Apple or Mozilla) to also be on board.

jungshik commented 5 years ago

I'd rather not tweak legacy encoding implementations any further in Chromium's copy of ICU unless it's absolutely necessary.

hsivonen commented 5 years ago

Thanks for that analysis, I guess that is indeed a novel angle that I'm not sure was fully considered.

While novel angle for the Encoding Standard, my understanding is that a similar structural concern kept Shift_JIS unsupported as a system encoding in Fedora so that the transition was direct from EUC-JP to UTF-8. (I don't know how DOS and pre-NT Windows dealt internally with 0x5C in two-byte characters in file paths under the Japanese locale. I also don't how if Fedora supported Big5 as a system encoding previously.)

In fact, I discovered this exactly from a client side JSON parsing (using this algorithm), with data produced server side (using the Big 5-HKSCS table by default).

Interesting both because JSON is not allowed to be Big5-encoded (irrelevant as far as providing security even for people who don't conform with specs goes) and because the spec is trying to be able to decode HKSCS.

While a backslash may have different implications than other ASCII-range bytes as the second byte of an unmappable sequence, I'm reluctant to make a special case for 0x5C in the the general ASCII unmasking policy with the level of evidence offered so far.

Before debating special-casing 0x5C as the trail byte when an index lookup fails, I'd be interested in learning what Big5-HKSCS generator can generate byte pairs that the index in the Encoding Standard does not have mappings for. We have mappings for the 0x5C trail byte for every lead byte from 0x87 onwards. We have no mappings for any byte pair, whose lead is in the range 0x81 to 0x86, inclusive. What software produced the 0x83, 0x5C byte sequence and what Big5 extension does it belong to? (CC @foolip)

you have to be sure that it's identical to those defined

That's relatively useless advice especially in the case of Big5, which, as defined in the spec, is a WHATWG synthesis that is unlikely to be an exact match for any legacy implementation.

hsivonen commented 5 years ago

The old Big5 study emails suggest have evidence of lead bytes in the 0x81 to 0x86 (inclusive) range in pages whose URLs seem no longer to work. How did those items of evidence not result in mappings for lead bytes in that range?

mihnita commented 5 years ago

The warning would be for content producers, who also must use UTF-8 per the standard already, so they are already in violation of sorts.

The trouble here is that XSS attacks are not coming from "friendly" producers. If we say "don't do this, it is a possible vulnerability", we are just inviting bad actors to abuse it.

I'm reluctant to make a special case for 0x5C in the the general ASCII unmasking policy

I am not advocating special treatment for 5C. I thing that a valid lead-byte - trailing byte sequence, recognized as such, should be treated as one "unit". And either converted to a Unicode character (if there is a mapping), or convert to FFFD if there is no conversion. But not convert half (to FFFD) and keep the other half.

I'd be interested in learning what Big5-HKSCS generator can generate byte pairs that the index in the Encoding Standard does not have mappings for

Anything Windows Code page 950. So probably most (all?) Windows APIs. I can run some tests, if you want.

And also ICU:

  @Test
  public void testBig5() {
    byte [] bytes = { (byte) 0x83, (byte) 0x5C };
    String charsetName = "Big5";

    java.nio.charset.Charset cs = java.nio.charset.Charset.forName(charsetName);
    System.out.println(cs + " : " + cs.aliases());
    System.out.println(hex(new String(bytes, cs)));

    cs = java.nio.charset.Charset.forName("cp950");
    System.out.println(cs + " : " + cs.aliases());
    System.out.println(hex(new String(bytes, cs)));

    cs = com.ibm.icu.charset.CharsetICU.forNameICU(charsetName);
    System.out.println(cs + " : " + cs.aliases());
    System.out.println(hex(new String(bytes, cs)));
  }

The code above produces:

Big5 : [csBig5]
 FFFD 005C
x-IBM950 : [cp950, ibm950, 950, ibm-950]
 F00E
Big5 : [windows-950, csBig5, x-big5, Big5, x-windows-950, windows-950-2000, ms950]
 F00E

(the hex method does a simple char by char hex dump, I did not include it here)

We notice that Java considers Big5 and cp950 to be different charsets, but ICU4J considers them aliases.

I totally understand the reluctance to change a spec, and to change implementations. We would all like to produce code / specs that are perfect, bug free, and never need updating. But this is the reality of things: we find problems, we should fix them.

I would really hate to see some exploit based on something this a few months down the line... Double byte character sets (DBCS) have been a source of bugs and exploits for a long time.

mihnita commented 5 years ago

If it helps you can also diff the mapping tables from the Unicode site: http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT

hsivonen commented 5 years ago

The code above produces:

Interesting. So it maps to the PUA. Maybe we should map byte pairs with leads in the 0x81 to 0x86 (inclusive) range to the PUA. It seems worthwhile to check what Microsoft does for codepage 950.

If it helps you can also diff the mapping tables from the Unicode site: http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT

These tables have mappings only starting from lead byte 0xA1 upwards.

hsivonen commented 5 years ago

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt says "EUDC" for those lead bytes (but also for 0x87 to 0xA0, inclusive).

The Unicode.org vendor repo does not appear to have a Microsoft submission for the Hong Kong "951" shadow codepage.

foolip commented 5 years ago

Before debating special-casing 0x5C as the trail byte when an index lookup fails, I'd be interested in learning what Big5-HKSCS generator can generate byte pairs that the index in the Encoding Standard does not have mappings for. We have mappings for the 0x5C trail byte for every lead byte from 0x87 onwards. We have no mappings for any byte pair, whose lead is in the range 0x81 to 0x86, inclusive. What software produced the 0x83, 0x5C byte sequence and what Big5 extension does it belong to? (CC @foolip)

I'm afraid that I don't know much about the software which produced the content that @annevk, I and others analyzed back in the day. I feel like the best way to answer questions about what the spec should say now would be to perform a new scrape looking for certain patterns, perhaps starting from the list of URLs in httparchive.

I'd rather not tweak legacy encoding implementations any further in Chromium's copy of ICU unless it's absolutely necessary.

@jungshik are we now in alignment with the Encodings spec for Big5? If not, do you know where the differences are?

mihnita commented 5 years ago

The WindowsBestFit table is not relevant here.

It is only used to convert from Unicode to a charset. It is used by default by WideCharToMultiByte (when the WC_NO_BEST_FIT_CHARS is not set) (https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-widechartomultibyte)

There is no such flag in MultiByteToWideChar

hsivonen commented 5 years ago

That seems relevant in terms of identifying an encoder that can produce byte sequences below the HKSCS area.

hsivonen commented 5 years ago

Some findings about WideCharToMultiByte with flags set to zero (i.e. "best fit" not forbidden):

950/Big5: The U+FFFD cells that can be seen in the Big5 visualization are filled with PUA code points. U+0080 maps to 0x80 and U+F8F8 (PUA) maps to 0xFF.

949/EUC-KR: U+0080 maps to 0x80. Additionally, the row immediately above Hanja and the row immediately below Hanja (starting from 0xA1 trail, i.e. really just the part of the row where the trail is part of the original EUC trail range) is filled with PUA code points. Since the trail is always non-ASCII, this doesn't pose a security risk.

932/Shift_JIS: There are 4 PUA code points that map to single non-ASCII byte each. We already knew about this. Since these are non-ASCII, they don't pose a security risk.

hsivonen commented 5 years ago

Random thought: eudcedit.exe offers to start from the start of the Unicode Private Use Area. In case of Shift_JIS and gbk, the Encoding Standard is well compatible with this. (Let's ignore for the moment if it's a good thing for text interchange.) The 949 PUA mappings that we don't have would also start from the start of the Unicode Private Use Area. However, in the case of Big5, HKSCS compat makes the start of the Private Use Area unavailable.

Is there a good reason why the Encoding Standard doesn't have the EUDC bits for EUC-KR even though it explicitly has them for Shift_JIS and as a side effect of gb18030 support for gbk?

hsivonen commented 5 years ago

(This is not yet a suggestion to make our EUC-KR PUA-consistent with 949. At present I'm just trying to understand why things are the way they are.)

hsivonen commented 5 years ago

It seems that not all PUA mappings in Windows legacy code pages are strictly considered part of EUDC.

hsivonen commented 5 years ago

moztw.org maintains a repository of information about Big5. Checking the UAO tables there, it's worth noting that UAO had mappings for the byte pairs whose lead is in the range 0x81 to 0x86 (inclusive), so even though the research that lead to the current state of the spec concluded that HKSCS was used on the Web and UAO pretty much not (and, therefore, the spec went with HKSCS instead of UAO), a UAO encoder could produce byte pairs whose lead is in the range 0x81 to 0x86 (inclusive). Both Python and Node have quite recent UAO packages.

jungshik commented 5 years ago

I'm against any more tweaking in general and extremely strongly against introducing any new decoding mapping to PUA (e.g for EUC-KR) in the Unicode.

hsivonen commented 5 years ago

I'm against any more tweaking in general

The more I've examined the issue reported here, the more convinced I am that

The browser-side security mitigations for legacy Java (Unicode internally, conversion to legacy encoding at output time) and legacy PHP (bytes in, bytes out) server-side architectures are in conflict.
The present security posture of the Encoding Standard is biased towards addressing the legacy PHP problem.
We can't actually mitigate the fundamental security problems of the legacy PHP architecture, because if we are worried about a rogue byte masking an ASCII byte that in trail range, the attacker can choose the rogue byte such that it is a valid lead for the ASCII byte acting as trail.
The issue reported here is a security problem for the legacy Java architecture, since there are encoders out there that can output byte pairs whose lead is in the 0x81 to 0x86 range if the attacker can feed arbitrary (BMP-only enough) Unicode into the system. (Either codepage 950 encoders emitting those bytes for PUA inputs or Big5-UAO encoders emitting those bytes for Unihan input. JSON in the original report here is a distraction and we should consider JS.)
The issue reported here is addressable, unlike the fundamental problem of the PHP architecture.

So at the very least I think we should tweak Big5 decode such that Big5 leads in the 0x81 to 0x86, inclusive, range consume the next byte, too, if it is in the Big5 trail range. I don't yet have an opinion on whether the byte pairs should result in error (U+FFFD) or in 950-consistent PUA code points. (Mixing UAO with HKSCS probably would hurt more than it would help.)

Since JIS X 0208 has undergone less extension and the extensions have happened ages ago, I think we probably don't need to change Shift_JIS analogously for trails that are in the trail range but unmapped, because there probably aren't any Shift_JIS-ish encoders around that could emit leads 0x82, 0x85, 0x86, 0x88, 0xEB, 0xEF, 0xEC or 0xFC with trail 0x5C for some Unicode input.

The 0x5C issue is moot for EUC-KR and gbk (but for opposite reasons).

and extremely strongly against introducing any new decoding mapping to PUA (e.g for EUC-KR) in the Unicode.

If we don't change EUC-KR, Shift_JIS and gbk to match the corresponding Windows code pages, I think we should at the very least document how and why encodings that for non-PUA, non-U+0080 points match Windows code pages don't match the Windows code pages for PUA and U+0080.

hsivonen commented 5 years ago

Since JIS X 0208 has undergone less extension and the extensions have happened ages ago, I think we probably don't need to change Shift_JIS analogously for trails that are in the trail range but unmapped,

For completeness, it seems worthwhile to mention: After a byte in the Shift_JIS lead range, IE and pre-Chromium Edge consume the following byte if it is in the Shift_JIS trail range regardless of its mappability, so if we were to change Shift_JIS, we'd be changing it to an IE/Egde-consistent state security-wise (even if not in terms of the character used for replacement of unmappables).

hsivonen commented 4 years ago

I think we should do this at least for Big5, because it would better protect against Java and Windows (kernel32.dll) -based generators getting XSSed.

@achristensen07, do you have an opinion?

hsivonen commented 4 years ago

Note that UTC disagreed with the Encoding Standard on this point: https://www.unicode.org/L2/L2019/19192-review-docs.pdf

I.e. making the requested change here would align the UTC position.

annevk commented 4 years ago

Their rationale cites ISO-2022-JP, which seems like a red herring.

achristensen07 commented 4 years ago

If I've understood this correctly, this seems to only change behavior when decoding "invalid" input. I would like to believe that most content comes from valid encoders, so I would hope compatibility issues from such a change would be minimal. I don't think I have a strong opinion.

hsivonen commented 3 years ago

Their rationale cites ISO-2022-JP, which seems like a red herring.

The paragraphs referring to VDCs are directly relevant to this issue.

hsivonen commented 3 years ago

We have telemetry from Firefox 86 that is best explained by the hypothesis that users in Taiwan and Hong Kong encounter unlabeled Big5 containing byte sequences that the Encoding Standard considers unmapped. (I'm working on getting the numbers that I'm looking at OKed for publication.)

Of the byte pairs in the Big5 range that aren't mapped by the Encoding Standard, only the ones with lead byte 0xA3 are unmapped in Internet Explorer. The rest map to the Private Use Area.

In that demo, the middle column labeled "Bytes" can be used to verify the decoding in IE. The rightmost column labeled "PUA NCR" contains a cross-browser numeric character reference for what IE decodes the bytes to. This can be used for probing fonts for glyph assignments in non-IE browsers. (The page is declared as zh-HK.)

hsivonen commented 3 years ago

AR PL UMing HK on Ubuntu 20.04 provides glyphs for U+EEFF and U+F7EF through U+F816, inclusive.

hsivonen commented 3 years ago

For clarity, the telemetry measures Text Encoding menu usage and not byte sequences. The connection to byte sequences is my hypothesis from menu usage.

hsivonen commented 3 years ago

On Windows 10, I see Arial provides an empty glyph for U+F301 and MingLiU_HKSCS provides ideographic glyphs for U+ED2B through U+EE9A, inclusive. (The glyphs in AR PL UMing HK were not ideographic.)

hsivonen commented 3 years ago

MingLiU_HKSCS AR PL UMing HK

hsivonen commented 3 years ago

(I'm working on getting the numbers that I'm looking at OKed for publication.)

Published

hsivonen commented 3 years ago

We have telemetry from Firefox 86 that is best explained by the hypothesis that users in Taiwan and Hong Kong encounter unlabeled Big5 containing byte sequences that the Encoding Standard considers unmapped.

I forgot to report back the progress here.

Firefox 89 shipped a version of chardetng that doesn't reject the Big5ness of the input if it contains structurally-valid but unmapped Big5 byte pairs. Telemetry changes don't look like they were caused by this change and instead look like they were dominated by removal of one of the encoding menu UI entry points, but the analysis of the telemetry data was structured to answer a different question and not this one.

whatwg / encoding

Big5 encoding mishandles some trailing bytes, with possible XSS #171