Open GoogleCodeExporter opened 9 years ago
I retested against AntiSamy 1.4.4, and observed that the behavior is different
between the DOM and SAX implementation.
With the DOM implementation:
- If the input contains the actual character (coded in Java as \ud840\udc3c),
the character is truncated as described above (the output is a single
character, \ud840). This is incorrect behavior.
With the SAX implementation:
- If the input contains the actual character (coded in Java as \ud840\udc3c),
the output contains the numeric character reference "𠀼". This is correct
behavior.
In both cases, if the input contains a numeric character reference (either
decimal "𠀼" or hex "𠀼"), the output is "<". This is incorrect. I believe
the root cause is probably org.cyberneko.html.HTMLScanner line 1454:
str.append((char)value);
Original comment by danr...@gmail.com
on 30 Jan 2012 at 9:41
Any progress on this issue?
I believe stripNonValidXMLCharacters is definitely broken:
http://stackoverflow.com/questions/6893749/detecting-high-surrogates-in-a-string
-using-regular-expressions
Original comment by Shchekl...@gmail.com
on 13 Jun 2012 at 2:18
With AntiSamy 1.5.1, the actual character appears to work, with both DOM and
SAX. Numeric character references still appear to be broken (with both DOM and
SAX), but I think it's a defect in NekoHTML. I've filed a bug against NekoHTML:
https://sourceforge.net/tracker/index.php?func=detail&aid=3609978&group_id=19512
2&atid=952178#
Original comment by danr...@gmail.com
on 4 Apr 2013 at 6:38
NekoHTML has accepted my patch, and released version 1.9.19. Once AntiSamy is
validated against NekoHTML 1.9.19, I think you can close this bug.
Original comment by danr...@gmail.com
on 30 Oct 2013 at 4:27
Original issue reported on code.google.com by
danr...@gmail.com
on 6 Jan 2012 at 5:37