Crash when parsing symbols

mbj commented 8 years ago

The following file (reduced from: https://github.com/ruby/spec/blob/master/core/symbol/casecmp_spec.rb) crashes parser:

# -*- encoding: us-ascii -*-

p :"\xC3"

Backtrace:

/home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/builders/default.rb:172:in `to_sym': invalid encoding symbol (EncodingError)
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/builders/default.rb:172:in `symbol_compose'
        from ruby23.y:1839:in `_reduce_467'
        from /home/mbj/.rubies/ruby-2.3.0/lib/ruby/2.3.0/racc/parser.rb:259:in `_racc_do_parse_c'
        from /home/mbj/.rubies/ruby-2.3.0/lib/ruby/2.3.0/racc/parser.rb:259:in `do_parse'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/base.rb:162:in `parse'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner/ruby_parse.rb:132:in `process'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:202:in `process_buffer'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:195:in `block in process_files'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:181:in `each'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:181:in `process_files'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:159:in `block in process_all_input'
        from /home/mbj/.rubies/ruby-2.3.0/lib/ruby/2.3.0/benchmark.rb:293:in `measure'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:157:in `process_all_input'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner/ruby_parse.rb:128:in `process_all_input'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:33:in `execute'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:11:in `go'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/bin/ruby-parse:6:in `<top (required)>'
        from /home/mbj/.gem/ruby/2.3.0/bin/ruby-parse:23:in `load'
        from /home/mbj/.gem/ruby/2.3.0/bin/ruby-parse:23:in `<main>'

Ruby 2.3.0-p0 accepts it and prints:

:"\xC3"

alexdowad commented 8 years ago

Interesting. Older versions of Ruby error out on this file.

alexdowad commented 8 years ago

Sorry, I was wrong. Even the newest version of Ruby errors out, if the encoding comment is missing. With the encoding comment, everything back to 1.9.2 can execute this file.

alexdowad commented 8 years ago

The message comes from sym_check_asciionly.

I'm just compiling a patched Ruby which prints the name of the invalid encoding, rather than just saying invalid symbol encoding.

alexdowad commented 8 years ago

OK, with patched Ruby:

7:50 ~sw/ruby % ./ruby ~prog/Ruby/parser/test3.rb                          [7393bf6...]
/home/alex/Programming/Ruby/parser/test3.rb: invalid encoding for symbol: UTF-8 (EncodingError)

alexdowad commented 8 years ago

The real problem here is that parser's lexer does not respect encoding comments. If there is an encoding comment, it could transcode all string literals to that encoding. This would provide closer compatibility with the behavior of the Ruby interpreter.

@whitequark Whaddya think?

alexdowad commented 8 years ago

BTW just sent a patch to the Ruby core team to make the error message @mbj shows above a bit more informative. https://bugs.ruby-lang.org/issues/12016

whitequark commented 8 years ago

@alexdowad Of course the lexer respects the encoding comment... see Source::Buffer.recognize_encoding. This must be a missing reencode_string somewhere.

alexdowad commented 8 years ago

Ooh. OK. I'll see if I can find where it is. On Jan 25, 2016 11:56 AM, "whitequark" notifications@github.com wrote:

@alexdowad https://github.com/alexdowad Of course the lexer respects the encoding comment... see Source::Buffer.recognize_encoding. This must be a missing reencode_string somewhere.

— Reply to this email directly or view it on GitHub https://github.com/whitequark/parser/issues/252#issuecomment-174455340.

alexdowad commented 8 years ago

The problem is on lines 104-106 of buffer.rb:

          input.
            force_encoding(detected_encoding).
            encode(Encoding::UTF_8)

detected_encoding is Encoding::US_ASCII, as expected. As it turns out, "\xC7" is invalid even in US-ASCII. And when it is passed through encode(Encoding::UTF_8), the result is also invalid UTF-8.

whitequark commented 8 years ago

Yup, that's a Ruby bug then.

alexdowad commented 8 years ago

I guess str.encode(enc).valid_encoding? should always be true? And if it's not a valid encoding, an exception should be raised?

whitequark commented 8 years ago

No, the characters "\"\\xC7\"" are valid UTF-8.

alexdowad commented 8 years ago

It's not "\"\\xC7\"", but simply "\xC7". Just one character. That has a 1 bit in most-significant position, so it shouldn't be valid UTF-8 at the end of a string. Right?

alexdowad commented 8 years ago

Uhh. Let me clarify here. The source itself is valid. But the string literal which is lexed from the source isn't.

Although Source::Buffer can be said to respect encoding comments (in a way), in that it interprets the source file as having the indicated encoding, it transcodes the source from the indicated encoding to UTF-8. Then, the lexer treats all string literals as UTF-8.

alexdowad commented 8 years ago

Sorry for some mild confusion in my previous comment.

whitequark commented 8 years ago

Correct--and then lexer.rl will call encode_escape on the just-lexed \xc7:

      # \xff
    | 'x' xdigit{1,2}
        % { @escape = encode_escape(tok(@escape_s + 1, p).to_i(16)) }

Or at least it's supposed to.

alexdowad commented 8 years ago

encode_escape doesn't help us, because lexer.encoding is not US-ASCII. The lexer has no idea that the source was originally US-ASCII.

alexdowad commented 8 years ago

I have a fix, not sure if you'll like it.

whitequark commented 8 years ago

Well, @lexer.encoding should be US-ASCII, that's what it's supposed to do.

whitequark commented 8 years ago

Incidentally, I've missed that your tok optimizations were invalid, since this...

    def tok(s = @ts, e = @te)
      source = @source[s...e]
      return source unless @need_encode
      source.encode(@encoding)
    end

is in no way equivalent to @source[s...e].

alexdowad commented 8 years ago

Well, @lexer.encoding should be US-ASCII, that's what it's supposed to do.

It doesn't. Have a look at the patch which I am just pushing, though. Don't merge it too quickly... we need to think carefully before making a change like this.

alexdowad commented 8 years ago

...is in no way equivalent to @source[s...e].

I saw that, but there was some way I had reasoned out that the conversion was actually safe. Let me look again and see if I can remember what it was.

whitequark commented 8 years ago

It doesn't.

Actually, now I remember why and how I implemented that. I tried emitting string literals in their "proper" encoding, but that would just cause downstream headache. Tooling does not want to deal with non-ASCII-compatible (US-ASCII-compatible in Ruby parlance, not ASCII-8BIT which is an extension of ASCII) encodings, so we do not emit that.

I think the fact that this file cannot be parsed is intended behavior, which we should document. @mbj?

alexdowad commented 8 years ago

@needs_encode is only ever true if the source has been transcoded to UTF-32LE (to make indexed retrieval of characters faster).

Are the changes which I made safe if the source has been converted to UTF-32LE? I can't tell for sure without testing.

I think what happened is that I had done some performance testing, hadn't found a case where converting to UTF-32LE was actually beneficial, and had a commit where the conversion to UTF-32LE was ripped out. With that gone, the use of tok in some of those places was really and truly redundant.

You said that you have found cases where converting to UTF-32LE saves a lot of time, so I removed that commit from the PR. But the commit which eliminated tok in some places stayed.

I'll test to see if this actually breaks anything.

alexdowad commented 8 years ago

Tooling does not want to deal with non-ASCII-compatible (US-ASCII-compatible in Ruby parlance, not ASCII-8BIT which is an extension of ASCII) encodings, so we do not emit that.

Hmm. But US-ASCII is ASCII compatible.

alexdowad commented 8 years ago

[5] pry(main)> Encoding.list.select(&:ascii_compatible?)
=> [#<Encoding:ASCII-8BIT>,                                                     [0/1876]
 #<Encoding:UTF-8>,
 #<Encoding:US-ASCII>,
 #<Encoding:UTF8-MAC>,
 #<Encoding:EUC-JP>,
 #<Encoding:Windows-31J>,
 #<Encoding:Big5>,
 #<Encoding:Big5-HKSCS>,
 #<Encoding:Big5-UAO>,
 #<Encoding:CP949>,
 #<Encoding:Emacs-Mule>,
 #<Encoding:EUC-KR>,
 #<Encoding:EUC-TW>,
 #<Encoding:GB2312>,
 #<Encoding:GB18030>,
 #<Encoding:GBK>,
 #<Encoding:ISO-8859-1>,
 #<Encoding:ISO-8859-2>,
 #<Encoding:ISO-8859-3>,
 #<Encoding:ISO-8859-4>,
 #<Encoding:ISO-8859-5>,
 #<Encoding:ISO-8859-6>,
 #<Encoding:ISO-8859-7>,
 #<Encoding:ISO-8859-8>,
 #<Encoding:ISO-8859-9>,
 #<Encoding:ISO-8859-10>,
 #<Encoding:ISO-8859-11>,
 #<Encoding:ISO-8859-13>,
 #<Encoding:ISO-8859-14>,
 #<Encoding:ISO-8859-15>,
 #<Encoding:ISO-8859-16>,
 #<Encoding:KOI8-R>,
 #<Encoding:KOI8-U>,
 #<Encoding:Shift_JIS>,
 #<Encoding:Windows-1251>,
 #<Encoding:IBM437>,
 #<Encoding:IBM737>,
 #<Encoding:IBM775>,
 #<Encoding:CP850>,
 #<Encoding:IBM852>,
 #<Encoding:CP852>,
 #<Encoding:IBM855>,
 #<Encoding:CP855>,
 #<Encoding:IBM857>,
 #<Encoding:IBM860>,
#<Encoding:IBM861>,
 #<Encoding:IBM862>,
 #<Encoding:IBM863>,
 #<Encoding:IBM864>,
 #<Encoding:IBM865>,
 #<Encoding:IBM866>,
 #<Encoding:IBM869>,
 #<Encoding:Windows-1258>,
 #<Encoding:GB1988>,
 #<Encoding:macCentEuro>,
 #<Encoding:macCroatian>,
 #<Encoding:macCyrillic>,
 #<Encoding:macGreek>,
 #<Encoding:macIceland>,
 #<Encoding:macRoman>,
 #<Encoding:macRomania>,
 #<Encoding:macThai>,
 #<Encoding:macTurkish>,
 #<Encoding:macUkraine>,
 #<Encoding:CP950>,
 #<Encoding:CP951>,
 #<Encoding:stateless-ISO-2022-JP>,
 #<Encoding:eucJP-ms>,
 #<Encoding:CP51932>,
 #<Encoding:EUC-JIS-2004>,
 #<Encoding:GB12345>,
 #<Encoding:Windows-1252>,
 #<Encoding:Windows-1250>,
 #<Encoding:Windows-1256>,
 #<Encoding:Windows-1253>,
 #<Encoding:Windows-1255>,
 #<Encoding:Windows-1254>,
 #<Encoding:TIS-620>,
 #<Encoding:Windows-874>,
 #<Encoding:Windows-1257>,
 #<Encoding:MacJapanese>,
 #<Encoding:UTF8-DoCoMo>,
 #<Encoding:SJIS-DoCoMo>,
 #<Encoding:UTF8-KDDI>,
 #<Encoding:SJIS-KDDI>,
 #<Encoding:stateless-ISO-2022-JP-KDDI>,
 #<Encoding:UTF8-SoftBank>,
 #<Encoding:SJIS-SoftBank>]

whitequark commented 8 years ago

US-ASCII is, ASCII-8BIT isn't for any practical purposes, despite Ruby claiming that it is.

whitequark commented 8 years ago

Well, I guess ASCII-compatible isn't the right term here. What I'm looking for is "embedded in Unicode space". ASCII-8BIT isn't.

alexdowad commented 8 years ago

ASCII-8BIT isn't for any practical purposes

Will ASCII-8BIT literals (strings, regexps, etc.) cause "incompatible encoding" errors downstream?

alexdowad commented 8 years ago

I am just trying to see if UTF-8 regexps work on ASCII-8BIT text and vice versa.

whitequark commented 8 years ago

Yep, they can't be concatenated with UTF-8 strings for example, so trying to display messages will break.

alexdowad commented 8 years ago

Ruby 1.9 did well to add support for various text encodings to Ruby, but they way they did it sure seems to lead to a lot of pain.

alexdowad commented 8 years ago

Well this is just peachy.

whitequark commented 8 years ago

:fire:

alexdowad commented 8 years ago

It sure sucks that valid Ruby files can contain string literals which we can't represent in any way.

What if we emit strings/symbols as Encoding::BINARY if they are not representable in UTF-8, and warn the users that they should expect those dreaded binary strings?

whitequark commented 8 years ago

And what do you do with them then?

alexdowad commented 8 years ago

(Insert flippant answer here)

The fact is that the Ruby language includes string literals which are not UTF-8. If someone wants to process Ruby code, they need to face up to that fact.

If you just use them in a comparison (#==), you are OK. Nothing bad will happen.

If you need to use a regex on them, that is also fine. Nothing bad will happen.

If you want to print them out directly, that will work fine. But probably you want to concatenate them with something first? Hmm.

I've just noticed that concatenating ASCII-8BIT onto UTF-8 seems to work fine, but not binary onto UTF-8. Weird. Don't know why.

alexdowad commented 8 years ago

If you need to use a regex on them

(I haven't checked this thoroughly -- I've just tried regexes which are checking for alphanumeric characters.)

whitequark commented 8 years ago

@mbj Your opinion here? I'm inclined to bail on this ASCII-8BIT case as EWONTFIX. Utility seems extremely marginal and we already do not support non-ASCII-compatible encodings, for instance.

alexdowad commented 8 years ago

It appears that @whitequark somehow forgot to ask for my opinion too. But that's fine, you don't have to ask me... I'm just too helpful and nice to withhold my (long and detailed) opinions from people!

@whitequark, one important point for you:

There is a very big difference between a parser which can handle 100% of valid Ruby source code, and one which can handle 99.9%, or any other quantity less than 100%.

If parser can handle absolutely everything that Ruby's parser can, then all users of parser get the ability to handle all Ruby code, almost for free. That is very valuable. That is why we do crazy things to support backslash-delimited strings (for example).

If it's almost 100%, but not quite, that is a different thing altogether.

It appears that ASCII-8BIT is infectious; you can combine ASCII-8BIT strings with UTF-8 any way you want, and the UTF-8 will be quietly converted to ASCII-8BIT. (If this is wrong, I would like to know.) So there is no special burden on the users to support this case. Generally, their code should "just work".

whitequark commented 8 years ago

@alexdowad I'm aware of your opinion already:

It sure sucks that valid Ruby files can contain string literals which we can't represent in any way.

Anyway, regarding the following:

There is a very big difference between a parser which can handle 100% of valid Ruby source code, and one which can handle 99.9%, or any other quantity less than 100%.

We don't handle 100% nor we ever will. https://github.com/whitequark/parser#compatibility-with-ruby-mri.

whitequark commented 8 years ago

Sorry, wrong link copied, that should've been https://github.com/whitequark/parser#known-issues.

whitequark commented 8 years ago

It's quite telling that two out of five of those are related to encodings, too.

alexdowad commented 8 years ago

Out of the 5, 2 of them break compatibility with old versions of Ruby, but are compatible with the newest versions. That leaves 3.

whitequark commented 8 years ago

So? That doesn't really matter. In your own words, either we support 100% of Rubies we claim to support, or not.

mbj commented 8 years ago

@mbj Your opinion here? I'm inclined to bail on this ASCII-8BIT case as EWONTFIX. Utility seems extremely marginal and we already do not support non-ASCII-compatible encodings, for instance.

First: Sorry for the long reaction time, the question was from February and I simply missed the notification.

We encountered situations like this before. And as this project is well aware 100% ruby compatibility is unarchievable, also because the definition of "100% ruby compatibility" is not existent and under steady flux.

This steady flux + the "ruby is defined by its implementation mantra" creates lots of unintended edge cases, many of these upstream might not even consider desired behavior, a tool like parser could try to support. (AKA bug compatiblity).

As development time is very limited, a trade-off needs to be made. I do think this case should fall under: Not in the subset parser implements.

I do think that its much better to invest the limited development resources on parser to handle a significant subset of ruby correct instead to hunt something we cannot reach (As the definition of done does not exist!).

Whats important that parser is aware of its limitations, and does not crash with "arbitrary exceptions" like encoding errors, Parser::UnsupportedSyntaxError (nice version of: LanguageToUglyError) should be raised in cases that are decided to be out of scope.

This would ease the job for downstream authors (like me) to adjust their tools, that expect to handle "arbitrary ruby inputs", as we now can be explicitly sure on when hitting an out of scope input, or an input where parser is bugged.

Naturally I'd expect that parser over the time reduces the instances of "explicitly out of scope" to 0 (or ruby reduces its definition of "valid" in upstream, but I've no hope for this), this reduction to 0 can be handled by people interested like @alexdowad. But we always have to have a build in first class assumption of: parser will not be able to handle all input MRI can handle.

OT: Other implementations of ruby also decided to choose a "sane subset" of ruby to implement, mostly feature wise. I do think a parser is allowed to reduce syntax support to a "sane subset", as parsing syntax is its feature.

mbj commented 8 years ago

@whitequark So to be more explicit for my wall of text / rantish thing above: Can we detect this case in a way a dedicated exception from the Parser::* namespace can be raised instead of the generic EncodingError? If parser would be changed to do so, and the issue is documented: I think this issue is closed.

I do not prevent @alexdowad to try to fix this issue, but as I understand it takes a long time, and the class of "not explicitly handled by parser" syntax needs to be made explicit in the API.

whitequark commented 8 years ago

@mbj This case is different from plain "Parser cannot handle that". In fact, Parser has no problems handling string literals in weird encodings! Instead, it's the downstream tools I'm concerned about:

Tooling does not want to deal with non-ASCII-compatible (US-ASCII-compatible in Ruby parlance, not ASCII-8BIT which is an extension of ASCII) encodings, so we do not emit that.

So in principle we can add a default-off switch "give me literals in file encoding". However, this adds some testing headache.

mbj commented 8 years ago

So in principle we can add a default-off switch "give me literals in file encoding". However, this adds some testing headache.

I'd be also fine the switch would not exists and parser would bail out with "I do not want to support this ruby". Reasons above.

alexdowad commented 8 years ago

Good points from @mbj above.

However, fixing this problem does not require adding a switch or anything like that. I previously raised a key point, which has not been addressed in this discussion:

It appears that ASCII-8BIT is infectious; you can combine ASCII-8BIT strings with UTF-8 any way you want, and the UTF-8 will be quietly converted to ASCII-8BIT. (If this is wrong, I would like to know.) So there is no special burden on the users to support this case. Generally, their code should "just work".

@whitequark says that "Tooling does not want to deal with... ASCII-8BIT", but if the above paragraph is true, there is no reason why tooling doesn't want ASCII-8BIT. Tooling doesn't want BINARY, sure.

whitequark / parser

Crash when parsing symbols #252