youxinren / snakeyaml

Automatically exported from code.google.com/p/snakeyaml
Apache License 2.0
0 stars 0 forks source link

It is possible to dump string data that cannot be loaded back #155

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. create a String from a UTF-8 byte[] interpreted as ISO-8859-1 but with 
special characters
2. use Yaml.dump()
3. try to use Yaml.load()

What is the expected output? What do you see instead?
Although I'd expect the special character data to be corrupted as we are 
interpreting it using the wrong charset Yaml should either a) refuse to dump 
the data or b) be able to read back anything it can dump.

What version of SnakeYAML are you using? On what Java version?
SnakeYAML 1.10
Java 1.6.0_33

Please provide any additional information below. (Often a failing test is
the best way to describe the problem.)
There is a Spock specification that reproduces the problem here: 
https://gist.github.com/3551913#comments
The original issue that led me to investigate this is here: 
https://github.com/robfletcher/betamax/issues/52

Original issue reported on code.google.com by robert.w...@gmail.com on 31 Aug 2012 at 4:37

GoogleCodeExporter commented 9 years ago
Please check that the test meets your expectations.

http://code.google.com/p/snakeyaml/source/browse/src/test/java/org/yaml/snakeyam
l/issues/issue155/BinaryTest.java

Original comment by py4fun@gmail.com on 1 Sep 2012 at 7:47

GoogleCodeExporter commented 9 years ago
A few questions are on the way...

1) According to http://en.wikipedia.org/wiki/ISO/IEC_8859-1 the 0x99 is unused 
and has no visual representation. (See the attachment). I could not find a way 
to detect such a character in the Java API (java.lang.Character). If such a 
character can be detected in a String, it can be dumped as binary data

2) I am afraid, if we can somehow detect a 'strange' character to make it 
binary, the problem will not be solved anyway. This is because a String will 
become a binary (!!binary) and it will be read as byte[] (which is not what you 
want). The YAML document cannot anyhow indicate the encoding for the binary 
data which makes it impossible for SnakeYAML to create characters out of bytes. 

3) Feel free to take the source and experiment with the tests.
You can change SafeRepresenter.BINARY_PATTERN to make the test for issue 155 
green.
This is an example:
public static Pattern BINARY_PATTERN = 
Pattern.compile("[\\x00-\\x08\\x0B\\x0C\\x0E-\\x1F\u0085\u00A0-\uD7FF\uE000-\uFF
FD]");

Any feedback is welcome.

Original comment by py4fun@gmail.com on 2 Sep 2012 at 8:33

Attachments:

GoogleCodeExporter commented 9 years ago
Hi, sorry I didn't get back sooner, I was away all weekend.

The test looks right, yes.

For my use case I think binary would work. Betamax 'records' HTTP traffic the 
first time a request is made & 'plays it back' subsequent times; sending the 
original byte[] would be the right thing to do as it would guarantee the 
'recorded' response is the same as the real one. That's not to say it's the 
right thing to do in the general case, though.

Original comment by robert.w...@gmail.com on 3 Sep 2012 at 8:16

GoogleCodeExporter commented 9 years ago
How are you going to detect whether the parsed data is a String or a byte[] ? 
Do you mean that you wish to ask at runtime the class of the returned object ?
Please be aware that the binary representation will be completely unreadable 
and uneditable by humans. Why do you need YAML then ?
In general, YAML is not supposed to transfer binary data. The very same byte 
sequence for one person means binary abracadabra, while for another person it 
may mean a beautiful German or Russian character.
As I already said, you can change the source to make the test work (and build 
the version that works for you). But it breaks expectations of other users (as 
you can clearly see when you run all the tests).
If we cannot find a general solution for everyone, then I would rather keep the 
things as they are, because it is more consistent.

And of course, SnakeYAML must always eat its own food. There must be no 
exception when SnakeYAML loads its own output. We still have to find a solution 
for it.

Original comment by py4fun@gmail.com on 3 Sep 2012 at 8:48

GoogleCodeExporter commented 9 years ago
The library already supports binary HTTP response data so if invalid strings 
were binary encoded it should work OK. You're right that would mean they aren't 
user editable but it's quite an edge case to get into this situation in the 
first place.

Original comment by robert.w...@gmail.com on 3 Sep 2012 at 4:12

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
The issue should be fixed now. Try the latest source or the SNAPSHOT.
http://code.google.com/p/snakeyaml/source/detail?r=a03784312f5beb3031fd8a08b47c1
6c6bff1f404

Please give it a try.

This change has also affected  issue 137 . I think, it has become more 
consistent. Hopefully the reporter for  issue 137  can give some feedback for 
this change.

Original comment by py4fun@gmail.com on 5 Sep 2012 at 8:09

GoogleCodeExporter commented 9 years ago
Interestingly I'm finding that if I set my HTTP response body to be the 
incorrectly decoded String then SnakeYAML 1.11-SNAPSHOT will dump it as binary. 
When I read it back the special character doesn't match but it no longer falls 
over.

If I set the HTTP response body as bytes then SnakeYAML is dumping it as a 
string (probably because I'm doing some stuff in my Representer implementation).

I need to do some more investigation as I'm also getting inconsistent results 
between running tests from the terminal using gradle and running them within 
IntelliJ IDEA, presumably there's a different default charset I'm not 
accounting for.

Original comment by robert.w...@gmail.com on 5 Sep 2012 at 8:47

GoogleCodeExporter commented 9 years ago
If you think, there is still something to do for SnakeYAML, please let us know.
Otherwise, we can close the issue and release version 1.11

Original comment by py4fun@gmail.com on 5 Sep 2012 at 3:10

GoogleCodeExporter commented 9 years ago
I think this can be closed. SnakeYAML is doing the right thing.

Original comment by robert.w...@gmail.com on 5 Sep 2012 at 3:16

GoogleCodeExporter commented 9 years ago
The fix will be provided in version 1.11

Original comment by py4fun@gmail.com on 5 Sep 2012 at 5:05