Crash when parsing symbols

mbj commented 8 years ago

The following file (reduced from: https://github.com/ruby/spec/blob/master/core/symbol/casecmp_spec.rb) crashes parser:

# -*- encoding: us-ascii -*-

p :"\xC3"

Backtrace:

/home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/builders/default.rb:172:in `to_sym': invalid encoding symbol (EncodingError)
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/builders/default.rb:172:in `symbol_compose'
        from ruby23.y:1839:in `_reduce_467'
        from /home/mbj/.rubies/ruby-2.3.0/lib/ruby/2.3.0/racc/parser.rb:259:in `_racc_do_parse_c'
        from /home/mbj/.rubies/ruby-2.3.0/lib/ruby/2.3.0/racc/parser.rb:259:in `do_parse'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/base.rb:162:in `parse'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner/ruby_parse.rb:132:in `process'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:202:in `process_buffer'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:195:in `block in process_files'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:181:in `each'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:181:in `process_files'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:159:in `block in process_all_input'
        from /home/mbj/.rubies/ruby-2.3.0/lib/ruby/2.3.0/benchmark.rb:293:in `measure'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:157:in `process_all_input'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner/ruby_parse.rb:128:in `process_all_input'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:33:in `execute'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/lib/parser/runner.rb:11:in `go'
        from /home/mbj/.gem/ruby/2.3.0/gems/parser-2.3.0.2/bin/ruby-parse:6:in `<top (required)>'
        from /home/mbj/.gem/ruby/2.3.0/bin/ruby-parse:23:in `load'
        from /home/mbj/.gem/ruby/2.3.0/bin/ruby-parse:23:in `<main>'

Ruby 2.3.0-p0 accepts it and prints:

:"\xC3"

alexdowad commented 8 years ago

One thing which has come out of this issue: ruby/ruby@dfca38e

An aside: getting up close and personal with Ruby like this makes it less and less interesting to me as a language for general development. For little scripts which process text files and such (AWK/Perl replacement), great. For little Rack apps which serve up 1 or 2 dynamic pages, great. For bigger projects, not great. Too messy.

mbj commented 8 years ago

An aside: getting up close and personal with Ruby like this makes it less and less interesting to me as a language for general development.

Same here.

For little Rack apps which serve up 1 or 2 dynamic pages, great. For bigger projects, not great. Too messy.

Same feelings, but:

There is a market from clients that already have "Big" ruby projects and cannot switch away from the language fast. As we know the "big rewrite" always fails. Most of my commercial time is spend on such clients, first fixing their Ruby to be least painful to manage on a big scale (which IMO breaks a lot with typical ruby mantras) and than in need (AKA when ROI is close enough time wise for business targets) slowly migrate away.

Sorry for hijacking this thread with that post.

mbj commented 8 years ago

@whitequark says that "Tooling does not want to deal with... ASCII-8BIT", but if the above paragraph is true, there is no reason why tooling doesn't want ASCII-8BIT. Tooling doesn't want BINARY, sure.

Hence I think tooling should not have to choose at all, and parser should simply raise its own exception, because it does not support this case. This is most easy for tooling, saying "no" to one in 10k files explicitly is better than failing with a random exception, or worse having a switch that only affects this specific file.

Subsets are fine, supersets (bugs) not.

whitequark commented 8 years ago

I decided to reject such literals by default. If downstream tooling actually wants to handle them, it can opt-in by using a custom AST builder.

mbj commented 8 years ago

I decided to reject such literals by default. If downstream tooling actually wants to handle them, it can opt-in by using a custom AST builder.

+1

riking commented 6 years ago

Tooling does not want to deal with non-ASCII-compatible (US-ASCII-compatible in Ruby parlance, not ASCII-8BIT which is an extension of ASCII) encodings, so we do not emit that.

I saw this misconception a lot in the thread and wanted to comment, even though it's about 2 years old by now.

In Ruby, ASCII-8BIT actually means "I have no idea, but it's probably text". You cannot do anything useful with it other than declare what encoding it actually is in and hope that's right. I really hate that name for it. It confuses everyone.

whitequark commented 6 years ago

Huh, TIL. Thanks.

whitequark / parser

Crash when parsing symbols #252