zesterer / chumsky

Write expressive, high-performance parsers with ease.
https://crates.io/crates/chumsky
MIT License
3.63k stars 155 forks source link

Incorrect error messages when using or() #397

Open TGMM opened 1 year ago

TGMM commented 1 year ago

I'm working on a parser for a typescript-like language. I initially used nom, but I switched to chumsky because I really like the error generating and the synergy with ariadne. But I noticed just today that the parser gives incorrect error messages when using the or combinator. I should mention I'm using logos for the lexer and I'm using the latest commit of the chumsky crate.

The error is mainly here and here.

To give a quick explanation, this parser is trying to parse something like let x = 10;. The issue arises when the statement end token is missing (which is ;). When using the first option in the or combinator (which tries to parse an int), the error is correct, as it displays: image

But when the value is something other than the first option (be it a bool or a float) it displays an incorrect message: image image The message is incorrect because false is a valid value and it should not be replaced by ;, but instead be followed by it. The parser should "know" that it correctly parsed a primitive_val (which is the bool or float val) and display the error message regarding the missing ;. I'd expect a message such as: image

You can try it yourself by cloning the repo and running cargo run, only changing the input from z = 10.0 to z = 10 to see the incorrectly and correctly displayed message respectively.

Am I doing something wrong or is this intended behavior?

zesterer commented 1 year ago

This is interesting. Do you get this with other error types such as Simple?

TGMM commented 1 year ago

Yes, when using simple I get the following errors When not using the first variant of the or: image When using the first variant of the or: image

zesterer commented 1 year ago

Do you think you could try to reduce the example parser / input to a minimum failing case?

TGMM commented 1 year ago

Yes, I think this is the example with the least code that still produces the bug. It seems related to Streams, since I couldn't reproduce it with regular string parsers. To give you a quick rundown, this parser just tries to parse either "f;" or "t;". We force the error by intentionally omitting the ';', I included both the correct and incorrect errors. I should note that if you change the order of the or combinator, only the one that's not the first variant gets the incorrect error.

TGMM commented 1 year ago

After analyzing this for some time, or should I say a LOT (this library is extremely complicated, props to you for maintaining it). I realized that not only does this work with non-streaming parsers, but it's not related to the or() combinator at all.

I noticed the choice combinator (which is used by or under the hood) saves the errors on each alternative (which is totally intended), and replaces them or merges them depending on the position of the next found error.

This issue occurs because None (which is the end of input) shares the same priority as the last character of the input. Thus, when comparing the error, the parser decides that, since they share the same priority, they should be merged. This all happens here.

The solution I thought about is to add a conditional to the Ordering:Equal pattern to give priority to the incoming error if the found Token is None, which would look something like this.

This solution passes all tests, except for one on the pratt parsing module. This can be fixed by changing the 1..2 range into 2..2 (which I believe should be the correct range since EOF doesn't technically occupy any space) like such.

Since I don't know if this is the best solution, I didn't open a PR, but please let me know if you would want me to. I hope this info is useful to you, and many thanks for creating and maintaining this library.

zesterer commented 1 year ago

This seems like a reasonable approach! Error prioritisation is one of those things I've put on the back foot for 1.0 in favour of sorting out the API (exact semantics of errors can be changed later without breaking semver), so I'm sure your improvement does the job. I'd be happy to review a PR for this :)