whitequark / parser

A Ruby parser.
Other
1.57k stars 197 forks source link

Parser raises for certain invalid encoding regexp #1032

Closed Earlopain closed 1 month ago

Earlopain commented 1 month ago
Parser::CurrentRuby.parse("/あ/n")
# => .../gems/parser-3.3.3.0/lib/parser/builders/default.rb:2261:in 'String#encode': U+3042 from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)

Offending source: https://github.com/whitequark/parser/blob/570e06520b81a107948d10fadaea89bd612b9a8d/lib/parser/builders/default.rb#L2249-L2268

Here's what prism has to say about this:

Prism.parse("/あ/n")
=> 
#<Prism::ParseResult:0x00007fb448c09438
 @comments=[],
 @data_loc=nil,
 @errors=
  [#<Prism::ParseError @type=:regexp_encoding_option_mismatch @message="regexp encoding option 'n' differs from source encoding 'UTF-8'" @location=#<Prism::Location @start_offset=6 @length=0 start_line=1> @level=:syntax>,
   #<Prism::ParseError @type=:regexp_non_escaped_mbc @message="/.../n has a non escaped non ASCII character in non ASCII-8BIT script: /あ/" @location=#<Prism::Location @start_offset=6 @length=0 start_line=1> @level=:syntax>],
 @magic_comments=[],
 @source=#<Prism::Source:0x00007fb448b865d8 @offsets=[0], @source="/あ/n", @start_line=1>,
 @value=
  @ ProgramNode (location: (1,0)-(1,6))
  ├── locals: []
  └── statements:
      @ StatementsNode (location: (1,0)-(1,6))
      └── body: (length: 1)
          └── @ RegularExpressionNode (location: (1,0)-(1,6))
              ├── flags: ascii_8bit
              ├── opening_loc: (1,0)-(1,1) = "/"
              ├── content_loc: (1,1)-(1,4) = "あ"
              ├── closing_loc: (1,4)-(1,6) = "/n"
              └── unescaped: "あ",
 @warnings=
  [#<Prism::ParseWarning @type=:void_statement @message="possibly useless use of a literal in void context" @location=#<Prism::Location @start_offset=0 @length=6 start_line=1> @level=:verbose>]>

(Originally raised at https://github.com/ruby/prism/issues/2957 but I think it makes more sense here)

I think, a pretty simple solution would be to simply catch the encoding error and wrap it into a diagnostic. WDYT?

iliabylich commented 1 month ago

a pretty simple solution would be to simply catch the encoding error and wrap it into a diagnostic

Yes, it makes sense. Would you like to send a PR? If not I'll do it myself in a few days.