Serialized representation of character U+FFFD causes exception on deserialization

GoogleCodeExporter commented 9 years ago

This produces an exception:
Yaml yaml = new Yaml();
yaml.load(yaml.dump("\uFFFD"));

unacceptable character '�' (0xFFFD) special characters are not allowed
in "<string>", position 0
    at org.yaml.snakeyaml.reader.StreamReader.checkPrintable(StreamReader.java:69)
    at org.yaml.snakeyaml.reader.StreamReader.<init>(StreamReader.java:49)
    at org.yaml.snakeyaml.Yaml.load(Yaml.java:399)

snakeyaml version 1.10, Apple Java 1.6.0_31 on OS X 10.6.8.

Original issue reported on code.google.com by johnk...@gmail.com on 1 Jun 2012 at 9:51

GoogleCodeExporter commented 9 years ago

This is according to the spec: http://yaml.org/spec/1.1/#id868518

The character you want to use is not printable.

Original comment by py4fun@gmail.com on 4 Jun 2012 at 9:26

GoogleCodeExporter commented 9 years ago

I think the "#xE000-#xFFFD" character range given in the YAML spec as being 
printable is intended to be inclusive of the upper bound.

While the YAML spec doesn't seem to specify fully which variant of BNF they are 
using to describe the syntax, in RFC 4234 ABNF, value range alternatives are 
inclusive.

Original comment by johnk...@gmail.com on 4 Jun 2012 at 4:40

GoogleCodeExporter commented 9 years ago

I think you are right.

Original comment by aso...@gmail.com on 4 Jun 2012 at 6:21

Changed state: Started

GoogleCodeExporter commented 9 years ago

I forgot why \uFFFD has been excluded. This is because Java returns this code 
in case of an error. There is no way to distinguish an error from this 
character. I shall put this info into the documentation to make it clear.
See the source here: 
http://code.google.com/p/snakeyaml/source/browse/src/main/java/org/yaml/snakeyam
l/reader/StreamReader.java#31

// NON_PRINTABLE changed from PyYAML: \uFFFD excluded because Java returns
// it in case of data corruption

Original comment by py4fun@gmail.com on 5 Jun 2012 at 10:15

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

I can't find any specific reference to U+FFFD in the Java documentation. But 
from what I understand, the general idea, not at all specific to Java, is that 
it gets inserted into Unicode text wherever a process is unable to convert a 
character between encoding correctly. It is however a valid, printable unicode 
codepoint, and there's nothing malformed about strings that contain it, and the 
YAML spec reflects this.

IMO, libraries generally shouldn't take special action on this character, 
because applications which accept arbitrary unicode input need to be able to 
work with this character, and the proper handling of it is 
application-specific. (The most common behavior I've seen in editors and web 
browsers is to have no special handling whatsoever, meaning they display the 
character's glyph from the font, same as any other printable character.)

If you're unwilling to consider changing the behavior of the deserializer, 
would you consider changing the behavior of the serializer to escape this 
character? The deserializer handles this character correctly when it is 
escaped. Then at least I'd know that round-tripping would work consistently, 
without having to preprocess all the strings I feed to snakeyaml.

Original comment by johnk...@gmail.com on 5 Jun 2012 at 5:08

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

It is not about willing/unwilling. It is a Java-specific problem. Feel free to 
propose a solution. If you can find a way to implement your requirement when 
_ALL_ the tests stay green, your solution will be taken for the next release. 

The problem is similar to the UTF-8 BOM mark. Java IO is broken and it does not 
ignore the UTF-8 BOM mark at the beginning of the stream. This is the only 
reason why SnakeYAML has 
http://code.google.com/p/snakeyaml/source/browse/src/main/java/org/yaml/snakeyam
l/reader/UnicodeReader.java

Original comment by py4fun@gmail.com on 5 Jun 2012 at 9:03

GoogleCodeExporter commented 9 years ago

The two unit tests that fail for me after making U+FFFD printable are:
org.yaml.snakeyaml.issues.issue68.NonAsciiCharacterTest.testLoadFromFileWithWron
gEncoding
org.pyyaml.PyReaderTest.testReaderUnicodeErrors

org.yaml.snakeyaml.issues.issue68.NonAsciiCharacterTest.testLoadFromFileWithWron
gEncoding() isn't actually configuring the Reader to report encoding errors. 
I'd change how it sets up the InputStreamReader:

CharsetDecoder decoder = Charset.forName("Cp1252").newDecoder();
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
Object text = yaml.load(new InputStreamReader(input, decoder));

Then when issue68.txt is passed through the Reader, it throws 
java.nio.charset.UnmappableCharacterException. Of course, then the test isn't 
really testing snakeyaml, its testing the behavior of java.io.Reader. So this 
unit test doesn't need to exist at all; issue 68 could have been resolved by 
informing the user that snakeyaml was working as designed, and if they want 
their Reader to throw exceptions on encoding errors, they can configure it to 
do so.

To fix org.pyyaml.PyReaderTest.testReaderUnicodeErrors, UnicodeReader.init() 
needs to be changed:
// Use given encoding
CharsetDecoder decoder = 
encoding.newDecoder().onUnmappableCharacter(CodingErrorAction.REPORT);
internalIn2 = new InputStreamReader(internalIn, decoder);

Then org.pyyaml.PyReaderTest.testReaderUnicodeErrors there needs to be an 
additional catch block to get the new type of exception that will get thrown:
} catch (YAMLException e) {
     assertTrue(e.toString(),
         e.toString().contains("MalformedInputException"));
} finally {

Original comment by johnk...@gmail.com on 5 Jun 2012 at 11:41

GoogleCodeExporter commented 9 years ago

Fixed. It will be delivered in version 1.11
Thank you.

http://code.google.com/p/snakeyaml/source/browse/src/test/java/org/yaml/snakeyam
l/issues/issue147/PrintableTest.java

Original comment by py4fun@gmail.com on 6 Jun 2012 at 12:49

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Any schedule for getting a 1.11 release out? There's at least one reported 
JRuby bug related to 0xFFFD being rejected.

It's not a critical thing for us, but we're pushing a new JRuby 1.7 preview 
release next week, and it would be nice to get SnakeYAML 1.11 in it.

FWIW, the reported issue was reported by me, because of the divergence from 
YAML specification. Nobody has reported a real-world JRuby issue due to 0xFFFD 
rejection.

Original comment by head...@headius.com on 26 Jul 2012 at 5:01

GoogleCodeExporter commented 9 years ago

The JRuby issue in question:

http://jira.codehaus.org/browse/JRUBY-6317

Original comment by head...@headius.com on 26 Jul 2012 at 5:01

GoogleCodeExporter commented 9 years ago

Dear JRuby developers, 
SnakeYAML has implemented a few fixes/features exclusively for JRuby. 
Unfortunately, the feedback from JRuby developers gets the lowerst priority.
We have a couple of places where we expect some info from JRuby:
http://jira.codehaus.org/browse/JRUBY-6067
http://code.google.com/p/snakeyaml/issues/detail?id=146

Once we get the feedback, we can close the corresponding issues and release 
SnakeYAML.
(version 1.11 will be released in August 2012)

Original comment by aso...@gmail.com on 27 Jul 2012 at 9:42

GoogleCodeExporter commented 9 years ago

We apologize for not being more responsive; I think these updates were getting 
funneled into my mail archive, and it has been a very busy summer.

I have commented on the bugs in question, including issue 132 that was 
connected to JRUBY-6067.

Original comment by head...@headius.com on 28 Sep 2012 at 9:35

youxinren / snakeyaml

Serialized representation of character U+FFFD causes exception on deserialization #147