Supplementary chars (surrogate pairs) are broken or stripped

y1z2g3 / owaspantisamy

Automatically exported from code.google.com/p/owaspantisamy

0 stars 0 forks source link

Supplementary chars (surrogate pairs) are broken or stripped #123

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Go to http://www.antisamysmoketest.com/ and enter the text a&#131072;z
2. Observe the results:
SAX: az
DOM: a&#x0;z

What is the expected output? What do you see instead?
The output should be the same as the input (just as submitting "abc" should 
return "abc"). Instead, the second character is either broken (when using DOM) 
or stripped entirely (when using SAX).

What version of the product are you using? On what operating system?
AntiSamy 1.3, platform independent.

Please provide any additional information below.
I've seen this issue discussed at 
https://lists.owasp.org/pipermail/owasp-antisamy/2011-August/000423.html

Original issue reported on code.google.com by danr...@gmail.com on 6 Jan 2012 at 5:37

GoogleCodeExporter commented 9 years ago

I retested against AntiSamy 1.4.4, and observed that the behavior is different 
between the DOM and SAX implementation.

With the DOM implementation:
- If the input contains the actual character (coded in Java as \ud840\udc3c), 
the character is truncated as described above (the output is a single 
character, \ud840). This is incorrect behavior.

With the SAX implementation:
- If the input contains the actual character (coded in Java as \ud840\udc3c), 
the output contains the numeric character reference "𠀼". This is correct 
behavior.

In both cases, if the input contains a numeric character reference (either 
decimal "𠀼" or hex "𠀼"), the output is "<". This is incorrect. I believe 
the root cause is probably org.cyberneko.html.HTMLScanner line 1454: 
str.append((char)value);

Original comment by danr...@gmail.com on 30 Jan 2012 at 9:41

GoogleCodeExporter commented 9 years ago

Any progress on this issue?

I believe stripNonValidXMLCharacters is definitely broken:

http://stackoverflow.com/questions/6893749/detecting-high-surrogates-in-a-string
-using-regular-expressions

Original comment by Shchekl...@gmail.com on 13 Jun 2012 at 2:18

GoogleCodeExporter commented 9 years ago

With AntiSamy 1.5.1, the actual character appears to work, with both DOM and 
SAX. Numeric character references still appear to be broken (with both DOM and 
SAX), but I think it's a defect in NekoHTML. I've filed a bug against NekoHTML: 
https://sourceforge.net/tracker/index.php?func=detail&aid=3609978&group_id=19512
2&atid=952178#

Original comment by danr...@gmail.com on 4 Apr 2013 at 6:38

GoogleCodeExporter commented 9 years ago

NekoHTML has accepted my patch, and released version 1.9.19. Once AntiSamy is 
validated against NekoHTML 1.9.19, I think you can close this bug.

Original comment by danr...@gmail.com on 30 Oct 2013 at 4:27