whitequark / parser

A Ruby parser.
Other
1.57k stars 197 forks source link

Parsing code with an invalid encoding raises #998

Closed Earlopain closed 4 months ago

Earlopain commented 4 months ago
Parser::CurrentRuby.parse("# encoding: utf")

# /home/earlopain/.rbenv/versions/3.3.0/lib/ruby/gems/3.3.0/gems/parser-3.3.0.5/lib/parser/source/buffer.rb:69:in `find': unknown encoding name - utf (ArgumentError)
#
#          Encoding.find(result[3] || result[4] || result[6])

Ruby does raise an error when trying to execute the file:

[earlopain@DESKTOP-PC whitequark-parser]$ ruby test.rb
test.rb:1: unknown encoding name: utf (ArgumentError)

It doesn't look like a SyntaxError, not quite sure what this actually is, where it's coming from. However I believe the parser gem here shouldn't raise an ArgumentError in this case. This makes tools like RuboCop raise as well when trying to analyze these invalid files.

iliabylich commented 4 months ago

What Ruby does also looks wrong to me but parser simply follows this behaviour. Ripper also throws ArgumentError:

> Ripper.sexp("# encoding: utf\n")
-:1: unknown encoding name: utf (ArgumentError)

To me this error looks very special, a set of encodings that is supported by your Ruby implementation/parser is not defined by any spec, and if it's not well-defined then it can't be a SyntaxError (because it's not a part of the syntax itself).

What should be thrown/returned in your opinion?

Earlopain commented 4 months ago

I'm not quite sure what the right thing to do would be to be honest.

This is what Prism returns:

irb(main):002> Prism.parse("# encoding: utf\n")
=> 
#<Prism::ParseResult:0x000075467d799cb0
 @comments=[#<Prism::InlineComment @location=#<Prism::Location @start_offset=0 @length=15 start_line=1>>],
 @data_loc=nil,
 @errors=
  [#<Prism::ParseError @message="unknown or invalid encoding in the magic comment" @location=#<Prism::Location @start_offset=12 @length=3 start_line=1> @level=:argument>],
 @magic_comments=[#<Prism::MagicComment @key="encoding" @value="utf">],
 @source=#<Prism::Source:0x000075467d8a2af8 @offsets=[0, 16], @source="# encoding: utf\n", @start_line=1>,
 @value=
  @ ProgramNode (location: (1,0)-(1,0))
  ├── locals: []
  └── statements:
      @ StatementsNode (location: (1,0)-(1,0))
      └── body: (length: 0),
 @warnings=[]>

They appear to have the luxury of defining each encoding themselves and since they only support one Ruby version currently it's not that complicated https://github.com/ruby/prism/blob/a09fc1f913fed81f820c23534674f3d1c1695aa2/src/encoding.c#L5011-L5198

I don't think that doing this here is worth the effort.

Before reporting I made this commit https://github.com/Earlopain/whitequark-parser/commit/ce9f99ac1a0acea4a5f1feec62196b365d8ff585 which just returns nil when an unrecognized encoding is encountered. This results in the default encoding of the ruby version to be used. Does that sound like an acceptable solution to you?

iliabylich commented 4 months ago

This results in the default encoding of the ruby version to be used. Does that sound like an acceptable solution to you?

No, it makes parser swallow the error (which IMO is the worst of "return something special"/"throw something"/"return nothing" options). Invalid encoding should result in an error of some kind, but I'm not sure that we can find a way that is reasonable and backward-compatible at the same time (even the former is not easy by itself).

Earlopain commented 4 months ago

A subclass of ArgumentError, like Parser::UnknownEncodingInMagicComment? If there are consumers that currently handle this behaviour they should continue to work with this change.

This would make the error more actionable. You can currently catch the ArgumentError of course but having something concrete like that is just nicer.

iliabylich commented 4 months ago

Agree, makes sense to me. It doesn't make this error "less special" but at least it'll be trivial to catch it.

Feel free to send a PR.

Earlopain commented 4 months ago

Will do, thank you 👍

koic commented 4 months ago

Note, Prism performs error-tolerant parsing. It would be good to keep in mind that there can be differences due to encoding:

Invalid Encoding

$ ruby -rprism -e 'p Prism.parse_success?("# encoding: utf")'
false

$ ruby -rprism -e 'p Prism.parse("# encoding: utf")'
#<Prism::ParseResult:0x00007fe565ab7c08 @value=@ ProgramNode (location: (1,0)-(1,0))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,0))
    └── body: (length: 0)
, @comments=[#<Prism::InlineComment @location=#<Prism::Location @start_offset=0 @length=15 start_line=1>>], @magic_comments=[#<Prism::MagicComment @key="encoding" @value="utf">], @data_loc=nil, @errors=[#<Prism::ParseError @message="unknown or invalid encoding in the magic comment" @location=#<Prism::Location @start_offset=12 @length=3 start_line=1> @level=:argument>], @warnings=[], @source=#<Prism::Source:0x00007fe565abc140 @source="# encoding: utf", @start_line=1, @offsets=[0]>>

Valid Encoding

$ ruby -rprism -e 'p Prism.parse_success?("# encoding: utf-8")'
true

$ ruby -rprism -e 'p Prism.parse("# encoding: utf-8")'
#<Prism::ParseResult:0x00007fd2ff857c88 @value=@ ProgramNode (location: (1,0)-(1,0))
├── locals: []
└── statements:
    @ StatementsNode (location: (1,0)-(1,0))
    └── body: (length: 0)
, @comments=[#<Prism::InlineComment @location=#<Prism::Location @start_offset=0 @length=17 start_line=1>>], @magic_comments=[#<Prism::MagicComment @key="encoding" @value="utf-8">], @data_loc=nil, @errors=[], @warnings=[], @source=#<Prism::Source:0x00007fd2ff85c080 @source="# encoding: utf-8", @start_line=1, @offsets=[0]>>
iliabylich commented 4 months ago

@koic Of course I fully support the idea of returning as much data in case of an error as possible, but what prism does can be unsafe, especially this part from your output above:

@value="utf-8"

There are no guarantees that what comes after # encoding is a valid UTF-8 string and as a consequence it may return a broken string which may trigger an error later. Ideally an AST that is returned from the parser must e 100% safe, or otherwise you'll need a ton of rescue EncodingError wrappers in the downstream code.

IMO the only safe context of the error are source maps which are somewhat obvious, and like I said before I think this error is very special, the whole parsing output should look like this:

type Output =
  | OK(<ast>, <syntax errors/warnings>, ...)
  | EncodingError(<location>)

which at the end makes no sense. This error is so unusual that it becomes a leaky abstraction that infects types and contracts, probably it can only be "fixed" by deprecating it.