zevv / npeg

PEGs for Nim, another take
MIT License
330 stars 22 forks source link

More/larger examples of Error reporting #43

Closed rishavs closed 1 year ago

rishavs commented 2 years ago

Hi

I am really not able to wrap my head around error reporting. I understand how to use it in small fragments. But I am not sure how to structure the grammar so that the error reporting is effective. Any longer example code here or general tips would be helpful.

thanks Rishav

zevv commented 2 years ago

Hi Rishav,

Yes, error handling in PEGs is not trivial, and I do not have any complete examples at hand. Basically, there are two options to handle errors in a PEG parser:

The simple solution is just to let it fail, NPeg will report this by setting the ok of the matchresult to false. It will also fill in the matchLen and matchMax for you, the latter is a pretty good indication of where your error occured and where your parser was not able to continue. From that you could generate a generic error message telling the user there is a syntax error, and probably provide a line and column number and a little snippet of the subject string to inform the user where the error was

There is another mechanism allowing a bit more control, and likely better error messages, but this requires more work when writing the grammar: for this you can use the E atom as described in the manual at https://github.com/zevv/npeg#parsing-error-handling. The E atom will abort parsing with an exception and pass you the string as error message; this is typically used in conjunction with the ordered choice operator |, for example:

number <- +{'0'..'9'} | E"number"

The above rule says: match one or more occurrences of any character from the set 0..9, or, when this fails to match, generate an error saying Parsing error at #14: expected "number".

The E atom allows you to generate error messages that are more helpful and can hint your users what is wrong in their text: saying "expected a number" is nicer then "syntax error at line 30 char 22".

I'm afraid the above does not really add anything to the manual, let me know if this is helpful and if there is any way we can improve the documentation on this.

zevv commented 2 years ago

People much smarter then my have also been thinking very hard about this problem:

http://www.inf.puc-rio.br/~roberto/docs/sblp2013-1.pdf

rishavs commented 2 years ago

The E atom allows you to generate error messages that are more helpful and can hint your users what is wrong in their text: saying "expected a number" is nicer then "syntax error at line 30 char 22".

I understand how to use the E atom - my struggle is simply not being able to understand how to structure my grammar so that I can use the E atom effectively.

For example this is my grammar (just handles var declarations and Type declarations)

let parser = peg "program":

    # Tokens
    tkTypeOperator              <- ':' * *Space
    tkEqualsOperator            <- '=' * *Space
    tkCommaOperator             <- ',' * *Space
    tkDotOperator               <- '.'
    tkOptionsOperator           <- '|' * *Space
    tkDummyOperator             <- '_'

    # Keywords
    kwVar                       <- "var " * *Space
    kwConst                     <- "const " * *Space
    kwPrimaryTypes              <- ("Num" | "Bool" | "Void" | "Any") * *Space
    keywords                    <- kwVar | kwConst | kwPrimaryTypes

    # Literals
    litInteger                  <- Digit * *(Digit | tkDummyOperator)
    litDecimal                  <- litInteger * tkDotOperator * litInteger 
    litNumber                   <- litDecimal | litInteger
    litBooleanValues            <- "true" | "false"
    literal                     <- litNumber | litBooleanValues

    # Expressions
    expression                  <- literal * *Space

    # Types
    typeOptions                 <- kwPrimaryTypes * *(tkOptionsOperator * kwPrimaryTypes)
    typeDef                     <- typeOptions 

    identifier                  <- !keywords * (+Lower * *(Alnum | tkDummyOperator)) * *Space
    typeIdentifier              <- +Upper * *(Alnum | tkDummyOperator) * *Space

    identDeclaration            <- identifier * ?(tkTypeOperator * typeDef)
    varDeclarationList          <- kwVar * identDeclaration * *(tkCommaOperator * identDeclaration)
    constDeclarationList        <- kwConst * identDeclaration * *(tkCommaOperator * identDeclaration)

    assignmentList              <- identifier * *(tkCommaOperator * identifier) * tkEqualsOperator * expression * *(tkCommaOperator * expression)
    varDeclareAndAssignList     <- varDeclarationList * tkEqualsOperator * expression * *(tkCommaOperator * expression)
    constDeclareAndAssignList   <- constDeclarationList * tkEqualsOperator * expression * *(tkCommaOperator * expression)

    eof                         <- !1
    statement                   <- varDeclareAndAssignList | constDeclareAndAssignList | varDeclarationList | constDeclarationList | assignmentList
    program                     <- *Space * +statement * *Space * eof

clearly adding the error atom on statement will not be effective. Instead if I should look at adding at a "literal" like level. but the way I have my literals right now, I am not sure it will be much help either.

I assume that I should be looking at creating options chains (x | y | z) in parts where I want the error to be handled. But I just don't know how to restructure my grammar for that, without having to rewrite the entire thing every time i discover that my hierarchy definition is not efficient.

Essentially, what would help me is understanding what are the best practices in structuring the grammar, or seeing some examples of well done grammars with error handling to understand how should I go about it.

Anyway, thank you for weighing in. Maybe the only solution is to just jump in with both feet.

zevv commented 2 years ago

Hmm, I think I see your problem: for example, you want to get a literal, which is defined as litNumber | litBooleanValues. If you were to add an error to litNumber like litDecimal | litInteger | E'number', this error would be thrown before you get a chance to try a boolean.

In this case, your error would need to be at the literal level; you could add a intermediate rule that implements the error without having to rewrite every call site of the literal rule:

literal2                    <- litNumber | litBooleanValues
literal                     <- literal2 | E"literal"

I'm not good at naming things, thus the literal2

Would something like that work for you?

zevv commented 1 year ago

Closing this because of inactivity, feel free to reopen if needed.