whitequark / parser

A Ruby parser.
Other
1.57k stars 197 forks source link

Crashes during escaped Unicode surrogate pairs parsing #855

Open RazrFalcon opened 2 years ago

RazrFalcon commented 2 years ago
> ruby-parse -v
ruby-parse based on parser version 3.1.2.0

> ruby-parse --32 -E -e '"\\u{D800}"'
Failed on: (fragment:0)
/Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/lexer.rb:17506:in `chr': invalid codepoint 0xD800 in UTF-8 (RangeError)
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/lexer.rb:17506:in `block in advance'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/lexer.rb:17494:in `each'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/lexer.rb:17494:in `advance'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/lexer/explanation.rb:19:in `advance'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/base.rb:252:in `next_token'
    from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/racc/parser.rb:259:in `_racc_do_parse_c'
    from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/racc/parser.rb:259:in `do_parse'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/base.rb:190:in `parse'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner/ruby_parse.rb:141:in `process'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:254:in `process_buffer'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:231:in `block in process_fragments'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:225:in `each'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:225:in `each_with_index'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:225:in `process_fragments'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:215:in `block in process_all_input'
    from /System/Library/Frameworks/Ruby.framework/Versions/2.6/usr/lib/ruby/2.6.0/benchmark.rb:293:in `measure'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:214:in `process_all_input'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner/ruby_parse.rb:137:in `process_all_input'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:35:in `execute'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/lib/parser/runner.rb:13:in `go'
    from /Library/Ruby/Gems/2.6.0/gems/parser-3.1.2.0/bin/ruby-parse:7:in `<top (required)>'
    from /usr/local/bin/ruby-parse:23:in `load'
    from /usr/local/bin/ruby-parse:23:in `<main>'

> ruby -v
ruby 2.6.8p205 (2021-07-07 revision 67951) [universal.arm64e-darwin21]

> ruby -e '"\\u{D800}"'
-e:1: invalid Unicode codepoint
"\u{D800}"

I would assume that U+D800...U+DFFF should be ignored.

iliabylich commented 2 years ago

I can't reproduce it locally:

$ ruby -v bin/ruby-parse --32 -E -e '"\\u{D800}"'
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
"\\u{D800}"
^~~~~~~~~~~ tSTRING "\\u{D800}"                 expr_end     [0 <= cond] [0 <= cmdarg]
"\\u{D800}"
           ^ false "$eof"                       expr_end     [0 <= cond] [0 <= cmdarg]
(str "\\u{D800}")

$ ruby -ve 'p "\\u{D800}"'
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
"\\u{D800}"

Is it related to an old version of Ruby? Could you try it on a version of Ruby that is still supported (i.e. at least 2.7)

My hunch is that old Ruby has old Unicode support that doesn't know about these codepoints.

RazrFalcon commented 2 years ago

This is the default Ruby on macos. I'm not sure if you do support it.

iliabylich commented 2 years ago

No, Ruby 2.7 is deprecated since 2022-04-12. We do run tests for 2.6.10 on CI, and at least this version works well. You can use rbenv/RVM or whatever is popular these days to install a newer version of Ruby.

I'm closing it, but feel free to reopen it if the error appears again for you with maintained Ruby versions (>= 2.7)

RazrFalcon commented 2 years ago

Am I still doing something wrong?

> ruby -v
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
> /opt/homebrew/lib/ruby/gems/3.1.0/gems/parser-3.1.2.0/bin/ruby-parse --32 -E -e '"\\u{D800}"'
Failed on: (fragment:0)
/opt/homebrew/lib/ruby/gems/3.1.0/gems/parser-3.1.2.0/lib/parser/lexer.rb:17506:in `chr': invalid codepoint 0xD800 in UTF-8 (RangeError)
...
RazrFalcon commented 2 years ago

Same, but using current master:

> ruby -v bin/ruby-parse --32 -E -e '"\\u{D800}"'
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [arm64-darwin21]
Failed on: (fragment:0)
/opt/homebrew/lib/ruby/gems/3.1.0/gems/parser-3.1.2.0/lib/parser/lexer.rb:17506:in `chr': invalid codepoint 0xD800 in UTF-8 (RangeError)
iliabylich commented 2 years ago

Sorry, bash escaping issue, I should've checked this code in a separate file. My bad.

$ /bin/cat test.rb
"\u{D800}"

$ ruby -v test.rb
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
test.rb:1: invalid Unicode codepoint
"\u{D800}"

$ ruby -v bin/ruby-parse --32 test.rb
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin19]
Failed on: test.rb
/Users/ilyabylich/Work/parser/lib/parser/lexer.rb:17506:in `chr': invalid codepoint 0xD800 in UTF-8 (RangeError)
...
stacktrace
...

This is a bug and it should be fixed, reopening.

The error comes from this line, codepoint is "D800".to_i(16) == 55296 and so Ruby gives an error on converting a codepoint to a character:

=> "D800".to_i(16).chr(Encoding::UTF_8)
RangeError (invalid codepoint 0xD800 in UTF-8)

I'm pretty sure we need to catch a RangeError and emit it as a :invalid_unicode_escape diagnostic (that's what Ruby parser does).

I'll fix it next week, thanks for reporting.

RazrFalcon commented 2 years ago

Sure, no problem. I was running it in Fish and didn't even though about shell escaping differences.