yeison / snakeyaml

Automatically exported from code.google.com/p/snakeyaml
Apache License 2.0
0 stars 0 forks source link

Encoding of unicode characters varies depending on file.encoding #160

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
From JRUBY-6930: http://jira.codehaus.org/browse/JRUBY-6930

On systems where file.encoding defaults to UTF-8, SnakeYAML does the right 
thing, only encoding non-printable characters when emitting YAML.

However, if file.encoding is something other than UTF-8 (like MacRoman on OS X 
or Windows-1252 on Windows) it will encode printable unicode characters as well 
using an invalid escape format.

An example script from JRuby:

require 'yaml'
s = "\u{00F6}"
yml = YAML.dump s
p yml

On JRuby, the resulting YAML string is "--- \xF6\n", where on MRI it is "--- 
ö\n...\n". Even if we were supposed to encode the character here, it should be 
using the \u or \U formats specified in YAML specification.

The relevant API calls involve the emitter; the logic in JRuby is handling a 
scalar event and passing the Java String for "ö" as the value of the event to 
SnakeYAML. A simple way to reproduce this would be to emit "ö" with SnakeYAML 
with file.encoding varying as described above.

My guess is that SnakeYAML is somewhere using a bare String#getBytes to get the 
encoded bytes of the string, which would not produce valid UTF-8 bytes when 
file.encoding is not UTF-8 (String#getBytes uses file.encoding for target 
Charset).

Original issue reported on code.google.com by head...@headius.com on 15 Oct 2012 at 10:31

GoogleCodeExporter commented 9 years ago
A YAML document can only be UTF-8 or UTF-16 encoded:

http://yaml.org/spec/1.1/#id868742

Original comment by py4fun@gmail.com on 16 Oct 2012 at 1:41

GoogleCodeExporter commented 9 years ago
That is indeed correct, but I'm not sure how it is relevant. In this case, we 
have a unicode String being encoded by SnakeYAML into YAML. However, SnakeYAML 
is encoding it differently based on whether file.encoding is set to UTF-8 or 
not. I believe this indicates it is using a bare String#getBytes somewhere, 
which will default to file.encoding.

Original comment by head...@headius.com on 17 Oct 2012 at 4:47

GoogleCodeExporter commented 9 years ago
SnakeYAML does not use default platform encoding. Feel free to prove the 
opposite with a test.
If you use Emitter, please be sure that you provide the properly configured 
Writer in the constructor:
public Emitter(Writer stream, DumperOptions opts) { ... }

Original comment by py4fun@gmail.com on 17 Oct 2012 at 8:27

GoogleCodeExporter commented 9 years ago
I believe the bug is in Base64Coder, lines 68-70:

    public static String encodeString(String s) {
        return new String(encode(s.getBytes()));
    }

This should be getBytes("UTF-8").

Original comment by head...@headius.com on 17 Oct 2012 at 8:29

GoogleCodeExporter commented 9 years ago
Hmm, actually I see that method isn't even used anymore (could be removed). 
I'll try to come up with a test case, but it's tricky to force the issue.

Original comment by head...@headius.com on 17 Oct 2012 at 10:15

GoogleCodeExporter commented 9 years ago
You may be right about the writer. I realize now we're not specifying an 
encoding for it. Thanks for that tip...I'll see if I can fix what we have, and 
if I'm able then we can close this issue. Back in a flash.

Original comment by head...@headius.com on 17 Oct 2012 at 10:45

GoogleCodeExporter commented 9 years ago
Ok, I think you were right. I modified our writer to specify an encoding, and 
the cases I had before work correctly now.

You may close this bug. Sorry for the noise!

Original comment by head...@headius.com on 17 Oct 2012 at 10:58

GoogleCodeExporter commented 9 years ago
As you might have noticed, the problem you mentioned in the comment #4 is not 
in SnakeYAML's code. I agree that it is a bug. But we do not use this method. 
It is still present because the author of the code contacted us asking to 
restore the original version of his code.

Original comment by py4fun@gmail.com on 18 Oct 2012 at 9:53