Closed mbj closed 8 years ago
Interesting. Older versions of Ruby error out on this file.
Sorry, I was wrong. Even the newest version of Ruby errors out, if the encoding comment is missing. With the encoding comment, everything back to 1.9.2 can execute this file.
The message comes from sym_check_asciionly
.
I'm just compiling a patched Ruby which prints the name of the invalid encoding, rather than just saying invalid symbol encoding
.
OK, with patched Ruby:
7:50 ~sw/ruby % ./ruby ~prog/Ruby/parser/test3.rb [7393bf6...]
/home/alex/Programming/Ruby/parser/test3.rb: invalid encoding for symbol: UTF-8 (EncodingError)
The real problem here is that parser
's lexer does not respect encoding comments. If there is an encoding comment, it could transcode all string literals to that encoding. This would provide closer compatibility with the behavior of the Ruby interpreter.
@whitequark Whaddya think?
BTW just sent a patch to the Ruby core team to make the error message @mbj shows above a bit more informative. https://bugs.ruby-lang.org/issues/12016
@alexdowad Of course the lexer respects the encoding comment... see Source::Buffer.recognize_encoding
. This must be a missing reencode_string
somewhere.
Ooh. OK. I'll see if I can find where it is. On Jan 25, 2016 11:56 AM, "whitequark" notifications@github.com wrote:
@alexdowad https://github.com/alexdowad Of course the lexer respects the encoding comment... see Source::Buffer.recognize_encoding. This must be a missing reencode_string somewhere.
— Reply to this email directly or view it on GitHub https://github.com/whitequark/parser/issues/252#issuecomment-174455340.
The problem is on lines 104-106 of buffer.rb:
input.
force_encoding(detected_encoding).
encode(Encoding::UTF_8)
detected_encoding
is Encoding::US_ASCII
, as expected. As it turns out, "\xC7"
is invalid even in US-ASCII. And when it is passed through encode(Encoding::UTF_8)
, the result is also invalid UTF-8.
Yup, that's a Ruby bug then.
I guess str.encode(enc).valid_encoding?
should always be true? And if it's not a valid encoding, an exception should be raised?
No, the characters "\"\\xC7\""
are valid UTF-8.
It's not "\"\\xC7\""
, but simply "\xC7"
. Just one character. That has a 1 bit in most-significant position, so it shouldn't be valid UTF-8 at the end of a string. Right?
Uhh. Let me clarify here. The source itself is valid. But the string literal which is lexed from the source isn't.
Although Source::Buffer
can be said to respect encoding comments (in a way), in that it interprets the source file as having the indicated encoding, it transcodes the source from the indicated encoding to UTF-8. Then, the lexer treats all string literals as UTF-8.
Sorry for some mild confusion in my previous comment.
Correct--and then lexer.rl will call encode_escape on the just-lexed \xc7
:
# \xff
| 'x' xdigit{1,2}
% { @escape = encode_escape(tok(@escape_s + 1, p).to_i(16)) }
Or at least it's supposed to.
encode_escape
doesn't help us, because lexer.encoding
is not US-ASCII. The lexer has no idea that the source was originally US-ASCII.
I have a fix, not sure if you'll like it.
Well, @lexer.encoding
should be US-ASCII, that's what it's supposed to do.
Incidentally, I've missed that your tok
optimizations were invalid, since this...
def tok(s = @ts, e = @te)
source = @source[s...e]
return source unless @need_encode
source.encode(@encoding)
end
is in no way equivalent to @source[s...e]
.
Well, @lexer.encoding should be US-ASCII, that's what it's supposed to do.
It doesn't. Have a look at the patch which I am just pushing, though. Don't merge it too quickly... we need to think carefully before making a change like this.
...is in no way equivalent to @source[s...e].
I saw that, but there was some way I had reasoned out that the conversion was actually safe. Let me look again and see if I can remember what it was.
It doesn't.
Actually, now I remember why and how I implemented that. I tried emitting string literals in their "proper" encoding, but that would just cause downstream headache. Tooling does not want to deal with non-ASCII-compatible (US-ASCII-compatible in Ruby parlance, not ASCII-8BIT which is an extension of ASCII) encodings, so we do not emit that.
I think the fact that this file cannot be parsed is intended behavior, which we should document. @mbj?
@needs_encode
is only ever true if the source has been transcoded to UTF-32LE (to make indexed retrieval of characters faster).
Are the changes which I made safe if the source has been converted to UTF-32LE? I can't tell for sure without testing.
I think what happened is that I had done some performance testing, hadn't found a case where converting to UTF-32LE was actually beneficial, and had a commit where the conversion to UTF-32LE was ripped out. With that gone, the use of tok
in some of those places was really and truly redundant.
You said that you have found cases where converting to UTF-32LE saves a lot of time, so I removed that commit from the PR. But the commit which eliminated tok
in some places stayed.
I'll test to see if this actually breaks anything.
Tooling does not want to deal with non-ASCII-compatible (US-ASCII-compatible in Ruby parlance, not ASCII-8BIT which is an extension of ASCII) encodings, so we do not emit that.
Hmm. But US-ASCII is ASCII compatible.
[5] pry(main)> Encoding.list.select(&:ascii_compatible?)
=> [#<Encoding:ASCII-8BIT>, [0/1876]
#<Encoding:UTF-8>,
#<Encoding:US-ASCII>,
#<Encoding:UTF8-MAC>,
#<Encoding:EUC-JP>,
#<Encoding:Windows-31J>,
#<Encoding:Big5>,
#<Encoding:Big5-HKSCS>,
#<Encoding:Big5-UAO>,
#<Encoding:CP949>,
#<Encoding:Emacs-Mule>,
#<Encoding:EUC-KR>,
#<Encoding:EUC-TW>,
#<Encoding:GB2312>,
#<Encoding:GB18030>,
#<Encoding:GBK>,
#<Encoding:ISO-8859-1>,
#<Encoding:ISO-8859-2>,
#<Encoding:ISO-8859-3>,
#<Encoding:ISO-8859-4>,
#<Encoding:ISO-8859-5>,
#<Encoding:ISO-8859-6>,
#<Encoding:ISO-8859-7>,
#<Encoding:ISO-8859-8>,
#<Encoding:ISO-8859-9>,
#<Encoding:ISO-8859-10>,
#<Encoding:ISO-8859-11>,
#<Encoding:ISO-8859-13>,
#<Encoding:ISO-8859-14>,
#<Encoding:ISO-8859-15>,
#<Encoding:ISO-8859-16>,
#<Encoding:KOI8-R>,
#<Encoding:KOI8-U>,
#<Encoding:Shift_JIS>,
#<Encoding:Windows-1251>,
#<Encoding:IBM437>,
#<Encoding:IBM737>,
#<Encoding:IBM775>,
#<Encoding:CP850>,
#<Encoding:IBM852>,
#<Encoding:CP852>,
#<Encoding:IBM855>,
#<Encoding:CP855>,
#<Encoding:IBM857>,
#<Encoding:IBM860>,
#<Encoding:IBM861>,
#<Encoding:IBM862>,
#<Encoding:IBM863>,
#<Encoding:IBM864>,
#<Encoding:IBM865>,
#<Encoding:IBM866>,
#<Encoding:IBM869>,
#<Encoding:Windows-1258>,
#<Encoding:GB1988>,
#<Encoding:macCentEuro>,
#<Encoding:macCroatian>,
#<Encoding:macCyrillic>,
#<Encoding:macGreek>,
#<Encoding:macIceland>,
#<Encoding:macRoman>,
#<Encoding:macRomania>,
#<Encoding:macThai>,
#<Encoding:macTurkish>,
#<Encoding:macUkraine>,
#<Encoding:CP950>,
#<Encoding:CP951>,
#<Encoding:stateless-ISO-2022-JP>,
#<Encoding:eucJP-ms>,
#<Encoding:CP51932>,
#<Encoding:EUC-JIS-2004>,
#<Encoding:GB12345>,
#<Encoding:Windows-1252>,
#<Encoding:Windows-1250>,
#<Encoding:Windows-1256>,
#<Encoding:Windows-1253>,
#<Encoding:Windows-1255>,
#<Encoding:Windows-1254>,
#<Encoding:TIS-620>,
#<Encoding:Windows-874>,
#<Encoding:Windows-1257>,
#<Encoding:MacJapanese>,
#<Encoding:UTF8-DoCoMo>,
#<Encoding:SJIS-DoCoMo>,
#<Encoding:UTF8-KDDI>,
#<Encoding:SJIS-KDDI>,
#<Encoding:stateless-ISO-2022-JP-KDDI>,
#<Encoding:UTF8-SoftBank>,
#<Encoding:SJIS-SoftBank>]
US-ASCII is, ASCII-8BIT isn't for any practical purposes, despite Ruby claiming that it is.
Well, I guess ASCII-compatible isn't the right term here. What I'm looking for is "embedded in Unicode space". ASCII-8BIT isn't.
ASCII-8BIT isn't for any practical purposes
Will ASCII-8BIT literals (strings, regexps, etc.) cause "incompatible encoding" errors downstream?
I am just trying to see if UTF-8 regexps work on ASCII-8BIT text and vice versa.
Yep, they can't be concatenated with UTF-8 strings for example, so trying to display messages will break.
Ruby 1.9 did well to add support for various text encodings to Ruby, but they way they did it sure seems to lead to a lot of pain.
Well this is just peachy.
:fire:
It sure sucks that valid Ruby files can contain string literals which we can't represent in any way.
What if we emit strings/symbols as Encoding::BINARY
if they are not representable in UTF-8, and warn the users that they should expect those dreaded binary strings?
And what do you do with them then?
(Insert flippant answer here)
The fact is that the Ruby language includes string literals which are not UTF-8. If someone wants to process Ruby code, they need to face up to that fact.
If you just use them in a comparison (#==
), you are OK. Nothing bad will happen.
If you need to use a regex on them, that is also fine. Nothing bad will happen.
If you want to print them out directly, that will work fine. But probably you want to concatenate them with something first? Hmm.
I've just noticed that concatenating ASCII-8BIT onto UTF-8 seems to work fine, but not binary onto UTF-8. Weird. Don't know why.
If you need to use a regex on them
(I haven't checked this thoroughly -- I've just tried regexes which are checking for alphanumeric characters.)
@mbj Your opinion here? I'm inclined to bail on this ASCII-8BIT case as EWONTFIX. Utility seems extremely marginal and we already do not support non-ASCII-compatible encodings, for instance.
It appears that @whitequark somehow forgot to ask for my opinion too. But that's fine, you don't have to ask me... I'm just too helpful and nice to withhold my (long and detailed) opinions from people!
@whitequark, one important point for you:
There is a very big difference between a parser
which can handle 100% of valid Ruby source code, and one which can handle 99.9%, or any other quantity less than 100%.
If parser
can handle absolutely everything that Ruby's parser can, then all users of parser
get the ability to handle all Ruby code, almost for free. That is very valuable. That is why we do crazy things to support backslash-delimited strings (for example).
If it's almost 100%, but not quite, that is a different thing altogether.
It appears that ASCII-8BIT is infectious; you can combine ASCII-8BIT strings with UTF-8 any way you want, and the UTF-8 will be quietly converted to ASCII-8BIT. (If this is wrong, I would like to know.) So there is no special burden on the users to support this case. Generally, their code should "just work".
@alexdowad I'm aware of your opinion already:
It sure sucks that valid Ruby files can contain string literals which we can't represent in any way.
Anyway, regarding the following:
There is a very big difference between a parser which can handle 100% of valid Ruby source code, and one which can handle 99.9%, or any other quantity less than 100%.
We don't handle 100% nor we ever will. https://github.com/whitequark/parser#compatibility-with-ruby-mri.
Sorry, wrong link copied, that should've been https://github.com/whitequark/parser#known-issues.
It's quite telling that two out of five of those are related to encodings, too.
Out of the 5, 2 of them break compatibility with old versions of Ruby, but are compatible with the newest versions. That leaves 3.
So? That doesn't really matter. In your own words, either we support 100% of Rubies we claim to support, or not.
@mbj Your opinion here? I'm inclined to bail on this ASCII-8BIT case as EWONTFIX. Utility seems extremely marginal and we already do not support non-ASCII-compatible encodings, for instance.
First: Sorry for the long reaction time, the question was from February and I simply missed the notification.
We encountered situations like this before. And as this project is well aware 100% ruby compatibility is unarchievable, also because the definition of "100% ruby compatibility" is not existent and under steady flux.
This steady flux + the "ruby is defined by its implementation mantra" creates lots of unintended edge cases, many of these upstream might not even consider desired behavior, a tool like parser could try to support. (AKA bug compatiblity).
As development time is very limited, a trade-off needs to be made. I do think this case should fall under: Not in the subset parser
implements.
I do think that its much better to invest the limited development resources on parser
to handle a significant subset of ruby correct instead to hunt something we cannot reach (As the definition of done does not exist!).
Whats important that parser is aware of its limitations, and does not crash with "arbitrary exceptions" like encoding errors, Parser::UnsupportedSyntaxError
(nice version of: LanguageToUglyError
) should be raised in cases that are decided to be out of scope.
This would ease the job for downstream authors (like me) to adjust their tools, that expect to handle "arbitrary ruby inputs", as we now can be explicitly sure on when hitting an out of scope input, or an input where parser is bugged.
Naturally I'd expect that parser
over the time reduces the instances of "explicitly out of scope" to 0 (or ruby reduces its definition of "valid" in upstream, but I've no hope for this), this reduction to 0 can be handled by people interested like @alexdowad. But we always have to have a build in first class assumption of: parser
will not be able to handle all input MRI can handle.
OT: Other implementations of ruby also decided to choose a "sane subset" of ruby to implement, mostly feature wise. I do think a parser
is allowed to reduce syntax support to a "sane subset", as parsing syntax is its feature.
@whitequark So to be more explicit for my wall of text / rantish thing above: Can we detect this case in a way a dedicated exception from the Parser::*
namespace can be raised instead of the generic EncodingError
? If parser would be changed to do so, and the issue is documented: I think this issue is closed.
I do not prevent @alexdowad to try to fix this issue, but as I understand it takes a long time, and the class of "not explicitly handled by parser" syntax needs to be made explicit in the API.
@mbj This case is different from plain "Parser cannot handle that". In fact, Parser has no problems handling string literals in weird encodings! Instead, it's the downstream tools I'm concerned about:
Tooling does not want to deal with non-ASCII-compatible (US-ASCII-compatible in Ruby parlance, not ASCII-8BIT which is an extension of ASCII) encodings, so we do not emit that.
So in principle we can add a default-off switch "give me literals in file encoding". However, this adds some testing headache.
So in principle we can add a default-off switch "give me literals in file encoding". However, this adds some testing headache.
I'd be also fine the switch would not exists and parser would bail out with "I do not want to support this ruby". Reasons above.
Good points from @mbj above.
However, fixing this problem does not require adding a switch or anything like that. I previously raised a key point, which has not been addressed in this discussion:
It appears that ASCII-8BIT is infectious; you can combine ASCII-8BIT strings with UTF-8 any way you want, and the UTF-8 will be quietly converted to ASCII-8BIT. (If this is wrong, I would like to know.) So there is no special burden on the users to support this case. Generally, their code should "just work".
@whitequark says that "Tooling does not want to deal with... ASCII-8BIT", but if the above paragraph is true, there is no reason why tooling doesn't want ASCII-8BIT. Tooling doesn't want BINARY, sure.
The following file (reduced from: https://github.com/ruby/spec/blob/master/core/symbol/casecmp_spec.rb) crashes parser:
Backtrace:
Ruby 2.3.0-p0 accepts it and prints: