zesterer / chumsky

Write expressive, high-performance parsers with ease.
https://crates.io/crates/chumsky
MIT License
3.63k stars 155 forks source link

Better documentaton of function parsers, pretty please #403

Open jamesyoungman opened 1 year ago

jamesyoungman commented 1 year ago

The examples all (AFAICS) use let to define parsers. But there are use-cases where this isn't convenient:

  1. Unit testing - for more complex languages, it's helpful to expose parsers for parts of the language so that they can be unit-tested independently.
  2. Beginners with chumsky - I'm trying to convert from nom to chumsky and the most obvious way to proceed is bottom-up; to implement chumsky parsers for the simplest parts of the grammar first, and then build upward.

Neither of these things seems very easy when constructing a correct impl Parser<...> declaration for the parser function's return type is entirely unobvious. I mean, the parser itself seems straighforward (obviously I picked a trivial parser for the sake of the example):

/// Accept either Unicode Dot Operator U+22C5 ("⋅") or Unicode Middle Dot U+00B7 ("·").
fn maybe_normal_dot<'srcbody, I: Input<'srcbody>, E>() -> impl Parser<'srcbody, I, bool, E> {
    one_of("\u{22C5}\u{00B7}").to(true).or_not()
}

but that parser can't be used anywhere (with chumsky 1.0.0-alpha.3) because the declared return type is missing a lot of trait bounds. Some guidance from the examples or the documentation would be very welcome.

zesterer commented 1 year ago

Do you know what sort of documentation would help alleviate this issue, out of interest?

jamesyoungman commented 1 year ago

I think the first step would be to better document the parameters of the Parser trait. The documentation of the Input token type and the "extra" trait parameter seems to be very minimal.

Beyond that, an example parser in which many of the non-terminals are parsed by functions which have unit-tests.

If you want to be more ambitious, a tutorial showing the incremental migration of a non-trivial parser to Chumsky from other frameworks (though my interest is primarily nom).

willothy commented 1 year ago

I just migrated from nom and I've been using the State type / with_state functions to insert AST nodes into an arena, returning a key instead of the node itself. I would be open to contributing some documentation relating to this use case if that would be useful.

zesterer commented 1 year ago

I would be open to contributing some documentation relating to this use case if that would be useful.

That would be very much appreciated!

faassen commented 1 year ago

I've been using a function-based approach with Chumsky 0.9, exactly because I wanted to make sure I could unit test. But this leads to some issues when recursion is involved, I think (I will explore more tomorrow). And I now understand that the function signature will become more difficult in the Chumsky 1.0 world.

So let me rephrase the question: how to unit test sub-parsers? If we have an answer to that, we know what documentation to provide.

Or is the whole question somehow misguided? How would one construct parsers in a bottom-up manner otherwise? If there are alternative strategies that avoid bottom-up construction with unit testing, I'd love to understand that too, and an explanation of that in the docs would be appropriate.

I read the comment by @willothy but while storing AST nodes into an arena seems cool, I'm unclear how it would help with unit testing parts of a larger parser.

jamesyoungman commented 1 year ago

There is an example of a function-based parser with recursion at https://github.com/TX-2/TX-2-simulator/blob/eb32020298309e9a631c5ef1fd4a89c52bf8f4f8/assembler/src/asmlib/parser.rs#L77

Message ID: @.***>

CraftSpider commented 1 year ago

What james linked is a good example - I commonly use a pattern kind of like the following (in 1.0):

type Input<'a> = ...;
type Extra<'a> = ...;

macro_rules! Parser {
    ($lt:lifetime, $ty:ty) => { impl Parser<$lt, Input<$lt>, Output, Extra<$lt>> + Clone + $lt }
}

struct SubExpr {}
struct Expr {}

impl Expr {
    fn parser<'a>() -> Parser!['a, Self] {
        recursive(|expr| SubExpr::parser(expr).whatever(...))
    }
}

impl SubExpr {
    fn parser<'a>(expr: Parser!['a, Expr]) -> Parser!['a, Self] {
        whatever().then(expr)
    }
}

With this pattern, you can easily test sub-parsers by just invoking Expr::parser().parse(), or even handle testing SubExpr::parser(...) in limited ways by feeding it some stub that just emits an expression, like SubExpr::parser(empty().to(Expr::Error)) or such.

I've been using this same pattern since before 0.9, and adapted it to 1.0 when it was developed, and I've found it works amazingly in both versions for me.

faassen commented 1 year ago

Thank you both for these answers, I'll take a look!

After I wrote the above, I realized I have another use case for sub-parsers in my code besides unit testing. I use a small subset of the language I'm implementing in a Rust macro to help define types for functions I plug into its library.

Right now I'm using Pest in my project, and it served me well for the last few months, but the code you need to write to turn the Pest parsed result into an AST is of the same magnitude as you'd need to do with Chumsky, and it's messier without combinators.

faassen commented 1 year ago

I got things to work in Chumsky 0.9.

Chumsky 1.0a however was much more challenging and I want to give a brief experience report for using functions that return parsers (useful for unit testing, reusing a part of the parser), and Chumsky 1.0 changes in general. Hopefully it can help inform documentation, or perhaps APIs. I realize that some of these are not directly actionable, but I thought I'd thrown them in anyway as food for discussion.

I like that there's a structured way to put in state now through Extra; it reduces parameter passing for that use case.

I like the idea that 1.0 offers performance improvements and increased flexibility, but at least for the use case of reusable parser functions, Chumsky 0.9 was a lot easier to handle than Chumsky 1.0a.

Anyway, I did get things to work with Chumsky 1.0a too, but since putting the pieces together was more complicated I thought I'd let you all know. Thanks for Chumsky!

jamesyoungman commented 1 year ago

My experience (with 1.0.0-alpha.4) closely mirrors that described by Martijn.

Message ID: @.***>

faassen commented 1 year ago

I must say that my comments are mostly from the perspective of someone upgrading; I see the API docs go into quite a bit of the traits.

That said, repeated/collect (containers?) and all both could do with some more attention.

willothy commented 1 year ago

I read the comment by @willothy but while storing AST nodes into an arena seems cool, I'm unclear how it would help with unit testing parts of a larger parser.

That comment wasn't intended to be related to unit testing, just documentation :)

faassen commented 1 year ago

Chumsky really wants you define parsers inside of a giant function to get the maximum benefit of the type checker. Using multiple functions creates quite a bit of complexity in signatures and goes against the flow of Chumsky.

So I figured out another way to be able to use (and test) sub-parsers: instead of returning a single Parser back from the single giant parser function, return multiple parsers.

This can be accomplish by a liberal helper of boxing (which is also necessary to get tolerable compile times anyway and according to the docs has virtually no run-time impact):

Here's a sketch of the struct you return (in 1.0a4):

struct ParserOutput<'a, I> where I: ... your input goes here... {
   sub_parser_a: Boxed<'a, 'a, I, TypeA>,
   sub_parser_b: Boxed<'a, 'a, I, TypeB>
}

Then in your test (or wherever), you can extract parsers like that:

let sub_parser_a = parser().sub_parser_a;

To ensure parsers are boxed call .boxed() on them. Remember to do this for the output of the recursive() function as well if you want to write recursive parsers.

There are still cases where you want to factor out functions when you want to create reusable code that's used to construct a lot of parsers: code that takes parsers as arguments and constructs a new parser. I tried this with closures but the type inference wasn't up to snuff, so there was little benefit. Boxed parsers are again helpful there in simplifying things; you're likely going to need this anyway to get Clone support.

faassen commented 1 year ago

Here's another report of further progress, in the hope it informs anyone after me reading this (or can serve as eventual documentation). I went with the approach described in the previous comment and ended up with an absolutely huge function (it's a big grammar). It worked well, but eventually vscode's clippy and/or rust analyzer really struggled, and took a long time. I boxed everything I could, but it wasn't enough, which made development more frustrating.

So I went back and split my mega function into multiple functions. There are big sub-sections of the grammar defined in each function, and I'm still returning multiple boxed parsers from each function, but putting some boundaries in the code appears to have helped and while the vscode 'clippy' phase still takes a while, it's now tolerable. And it's nicer to have organized separate things into separate functions, too.

CraftSpider commented 1 year ago

Quick note: #481 would make SimpleSpan hashable, since I figured that made sense to add while I was working on it.

airstrike commented 2 months ago

Here's another report of further progress, in the hope it informs anyone after me reading this (or can serve as eventual documentation). I went with the approach described in the previous comment and ended up with an absolutely huge function (it's a big grammar). It worked well, but eventually vscode's clippy and/or rust analyzer really struggled, and took a long time. I boxed everything I could, but it wasn't enough, which made development more frustrating.

So I went back and split my mega function into multiple functions. There are big sub-sections of the grammar defined in each function, and I'm still returning multiple boxed parsers from each function, but putting some boundaries in the code appears to have helped and while the vscode 'clippy' phase still takes a while, it's now tolerable. And it's nicer to have organized separate things into separate functions, too.

thank you for the guidance. if by any chance you ever have the time and inclination to contribute examples of this to the crate, it would be immensely appreciated by yours truly and certainly many others too