whitequark / parser

A Ruby parser.
Other
1.59k stars 199 forks source link

bad UTF-8 string literal with double quote raises Parser::SyntaxError #854

Closed ko1 closed 2 years ago

ko1 commented 2 years ago

Is it intentional?

require 'parser/current'
p Parser::CurrentRuby.parse('p "bad-utf8-\xF1"')
$ ruby -v ~/src/rb/t.rb
ruby 2.7.6p219 (2022-04-12 revision c9c2245c0a) [x86_64-linux]
(string):1:3: error: literal contains escape sequences incompatible with UTF-8
(string):1: p "bad-utf8-\xF1"
(string):1:   ^~~~~~~~~~~~~~~
Traceback (most recent call last):
        9: from /home/ko1/src/rb/t.rb:2:in `<main>'
        8: from /home/ko1/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/parser-3.1.2.0/lib/parser/base.rb:33:in `parse'
        7: from /home/ko1/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/parser-3.1.2.0/lib/parser/base.rb:190:in `parse'
        6: from (eval):3:in `do_parse'
        5: from (eval):3:in `_racc_do_parse_c'
        4: from /home/ko1/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/parser-3.1.2.0/lib/parser/ruby27.rb:6946:in `_reduce_546'
        3: from /home/ko1/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/parser-3.1.2.0/lib/parser/builders/default.rb:320:in `string'
        2: from /home/ko1/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/parser-3.1.2.0/lib/parser/builders/default.rb:2262:in `string_value'
        1: from /home/ko1/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/parser-3.1.2.0/lib/parser/builders/default.rb:2274:in `diagnostic'
/home/ko1/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/parser-3.1.2.0/lib/parser/diagnostic/engine.rb:72:in `process': literal contains escape sequences incompatible with UTF-8 (Parser::SyntaxError)
ko1 commented 2 years ago

https://github.com/whitequark/parser#invalid-characters-inside-comments-and-literals "including the expanded escape sequences" matches this case?

iliabylich commented 2 years ago

Yes, this is intentional. Accepting this code means returning strings with invalid encoding to downstream code.

This change was introduced in https://github.com/whitequark/parser/commit/95401a20e8f4532e32f6361da3918ac8e4bd18c7, it's been reported (and rejected) multiple times, if you really want to handle these invalid strings the workaround is quite simple.

ko1 commented 2 years ago

Thank you so much for very quick response!