HtmlAttr, Js, and Css escapers fail for characters outside the BMP

zerocrates commented 9 years ago

For an example, take the character "🍥" (FISH CAKE WITH SWIRL DESIGN, U+1F365)

escapeHtml, since it uses htmlspecialchars, just passes this through unchanged.

On the other hand, escapeHtmlAttr internally tries to convert it to UTF-16 big-endian, resulting in the sequence D83CDF65 and a final output of &#xD83CDF65;. That's not a valid HTML character reference for anything. The correct character reference in this case would be 🍥.

What it looks like is happening is that the escaper code always assumes that converting to UTF-16 will be always be sufficient to return a direct codepoint value as required for an HTML entity, but that's not correct. Characters from beyond the Basic Multilingual Plane will be encoded in UTF-16 as a surrogate pair. The incorrect attempt to print this result is how you get the crazy 8-hex-digit value instead of the appropriate 5 digits for the "fish cake" example.

It's possible to instead convert the input into UTF-32BE, which doesn't use surrogate pairs for any Unicode codepoint. The rest of the logic used by escapeHtmlAttr should then work fine.

zerocrates commented 9 years ago

I only personally noted this problem with escapeHtmlAttr, but it looks like the JS and CSS escapers use similar algorithms and would have the same problem.

zerocrates commented 9 years ago

Yes, escapeJs and escapeCss also have the same problem. Using the same example as above:

escapeJs returns '\uD83CDF65' instead of the correct '\uD83C\uDF65'
escapeCss returns '\D83CDF65 ' instead of the correct '\1F365 '

marc-mabe commented 9 years ago

@zerocrates Doesn't has UTF-32 the same/similar issue with combining characters (https://en.wikipedia.org/wiki/Combining_character) same as all other unicode encodings as the encodings describes how e unicode code point is represented and not how a character is represented?

zerocrates commented 9 years ago

You're right to say that UTF-32 and UTF-16 don't treat combining characters differently, but that's not the basis of the problem here.

The problem here is with supplementary characters (those above U+FFFF), not combining characters. For supplementary characters, UTF-16 uses a surrogate pair to represent a single codepoint, while UTF-32 does not.

zerocrates commented 9 years ago

Just for confirmation purposes I tried out the escapers with a simple combining-character example and they seem to all be fine. When you have input with a regular ASCII "e" followed by a combining accent, you get that same sequence back out from the escaper, the "e" untouched followed by the escaped combining character.

You could use Normalizer to apply the NFC algorithm and guarantee precomposed output, but I think that would be unexpected and it would also mean requiring the intl extension for the escapers to work, which seems like a bridge too far. Users who need or want normalization can still use Normalizer themselves on the input.

Just correctly escaping codepoints seems like the proper focus for the escapers, and that's what this issue and my pull request aim at.

zerocrates commented 8 years ago

I'd appreciate some response on this issue and/or the associated PR.

This is a pretty serious issue for anybody using emoji or many less-common CJK characters. It's also not something an user of the framework can easily work around due to the use of the misbehaving escapers in other view helpers (in particular, escapeHtmlAttr is used all over the place).

roelvanduijnhoven commented 8 years ago

Just learned that ZF2 out of the box does not work well with Emoji's! Zend\Form fails to properly show them, indeed due to escapeHtmlAttr.

The underlying htmlAttrMatcher uses ord to check for their ASCII character. From what i read ord is in no way able to handle multibyte characters and is thus not able to parse UTF-8.

Thus escaping UTF-8 strings is bugged. Seems like a serious issue. Have too little knowledge to contribute however.

zendframework / zend-escaper

HtmlAttr, Js, and Css escapers fail for characters outside the BMP #2