Open jamesyoungman opened 1 year ago
Do you know what sort of documentation would help alleviate this issue, out of interest?
I think the first step would be to better document the parameters of the Parser trait. The documentation of the Input token type and the "extra" trait parameter seems to be very minimal.
Beyond that, an example parser in which many of the non-terminals are parsed by functions which have unit-tests.
If you want to be more ambitious, a tutorial showing the incremental migration of a non-trivial parser to Chumsky from other frameworks (though my interest is primarily nom).
I just migrated from nom and I've been using the State
type / with_state
functions to insert AST nodes into an arena, returning a key instead of the node itself. I would be open to contributing some documentation relating to this use case if that would be useful.
I would be open to contributing some documentation relating to this use case if that would be useful.
That would be very much appreciated!
I've been using a function-based approach with Chumsky 0.9, exactly because I wanted to make sure I could unit test. But this leads to some issues when recursion is involved, I think (I will explore more tomorrow). And I now understand that the function signature will become more difficult in the Chumsky 1.0 world.
So let me rephrase the question: how to unit test sub-parsers? If we have an answer to that, we know what documentation to provide.
Or is the whole question somehow misguided? How would one construct parsers in a bottom-up manner otherwise? If there are alternative strategies that avoid bottom-up construction with unit testing, I'd love to understand that too, and an explanation of that in the docs would be appropriate.
I read the comment by @willothy but while storing AST nodes into an arena seems cool, I'm unclear how it would help with unit testing parts of a larger parser.
There is an example of a function-based parser with recursion at https://github.com/TX-2/TX-2-simulator/blob/eb32020298309e9a631c5ef1fd4a89c52bf8f4f8/assembler/src/asmlib/parser.rs#L77
Message ID: @.***>
What james linked is a good example - I commonly use a pattern kind of like the following (in 1.0):
type Input<'a> = ...;
type Extra<'a> = ...;
macro_rules! Parser {
($lt:lifetime, $ty:ty) => { impl Parser<$lt, Input<$lt>, Output, Extra<$lt>> + Clone + $lt }
}
struct SubExpr {}
struct Expr {}
impl Expr {
fn parser<'a>() -> Parser!['a, Self] {
recursive(|expr| SubExpr::parser(expr).whatever(...))
}
}
impl SubExpr {
fn parser<'a>(expr: Parser!['a, Expr]) -> Parser!['a, Self] {
whatever().then(expr)
}
}
With this pattern, you can easily test sub-parsers by just invoking Expr::parser().parse()
, or even handle testing SubExpr::parser(...)
in limited ways by feeding it some stub that just emits an expression, like SubExpr::parser(empty().to(Expr::Error))
or such.
I've been using this same pattern since before 0.9, and adapted it to 1.0 when it was developed, and I've found it works amazingly in both versions for me.
Thank you both for these answers, I'll take a look!
After I wrote the above, I realized I have another use case for sub-parsers in my code besides unit testing. I use a small subset of the language I'm implementing in a Rust macro to help define types for functions I plug into its library.
Right now I'm using Pest in my project, and it served me well for the last few months, but the code you need to write to turn the Pest parsed result into an AST is of the same magnitude as you'd need to do with Chumsky, and it's messier without combinators.
I got things to work in Chumsky 0.9.
Chumsky 1.0a however was much more challenging and I want to give a brief experience report for using functions that return parsers (useful for unit testing, reusing a part of the parser), and Chumsky 1.0 changes in general. Hopefully it can help inform documentation, or perhaps APIs. I realize that some of these are not directly actionable, but I thought I'd thrown them in anyway as food for discussion.
filter()
now requires any
and the second generic type in any has to be the Extra type you define, otherwise it won't work and the error messages aren't clear at all. This eluded me for a long time.repeated
, then map
doesn't work anymore. map
gets the unit type as an argument. That was very confusing. I have to use collect after repeated
instead. Is there a way to avoid that? Couldn't repeated collect automatically? I need to study the docs to see what I missed. There's a common pattern where I want to parse something, then parse a repeated thing, and collect all of them in a vec - maybe there's a better way.I like that there's a structured way to put in state now through Extra; it reduces parameter passing for that use case.
I like the idea that 1.0 offers performance improvements and increased flexibility, but at least for the use case of reusable parser functions, Chumsky 0.9 was a lot easier to handle than Chumsky 1.0a.
Anyway, I did get things to work with Chumsky 1.0a too, but since putting the pieces together was more complicated I thought I'd let you all know. Thanks for Chumsky!
My experience (with 1.0.0-alpha.4) closely mirrors that described by Martijn.
Message ID: @.***>
I must say that my comments are mostly from the perspective of someone upgrading; I see the API docs go into quite a bit of the traits.
That said, repeated
/collect
(containers?) and all
both could do with some more attention.
I read the comment by @willothy but while storing AST nodes into an arena seems cool, I'm unclear how it would help with unit testing parts of a larger parser.
That comment wasn't intended to be related to unit testing, just documentation :)
Chumsky really wants you define parsers inside of a giant function to get the maximum benefit of the type checker. Using multiple functions creates quite a bit of complexity in signatures and goes against the flow of Chumsky.
So I figured out another way to be able to use (and test) sub-parsers: instead of returning a single Parser back from the single giant parser
function, return multiple parsers.
This can be accomplish by a liberal helper of boxing (which is also necessary to get tolerable compile times anyway and according to the docs has virtually no run-time impact):
Here's a sketch of the struct you return (in 1.0a4):
struct ParserOutput<'a, I> where I: ... your input goes here... {
sub_parser_a: Boxed<'a, 'a, I, TypeA>,
sub_parser_b: Boxed<'a, 'a, I, TypeB>
}
Then in your test (or wherever), you can extract parsers like that:
let sub_parser_a = parser().sub_parser_a;
To ensure parsers are boxed call .boxed()
on them. Remember to do this for the output of the recursive()
function as well if you want to write recursive parsers.
There are still cases where you want to factor out functions when you want to create reusable code that's used to construct a lot of parsers: code that takes parsers as arguments and constructs a new parser. I tried this with closures but the type inference wasn't up to snuff, so there was little benefit. Boxed parsers are again helpful there in simplifying things; you're likely going to need this anyway to get Clone
support.
Here's another report of further progress, in the hope it informs anyone after me reading this (or can serve as eventual documentation). I went with the approach described in the previous comment and ended up with an absolutely huge function (it's a big grammar). It worked well, but eventually vscode's clippy and/or rust analyzer really struggled, and took a long time. I boxed everything I could, but it wasn't enough, which made development more frustrating.
So I went back and split my mega function into multiple functions. There are big sub-sections of the grammar defined in each function, and I'm still returning multiple boxed parsers from each function, but putting some boundaries in the code appears to have helped and while the vscode 'clippy' phase still takes a while, it's now tolerable. And it's nicer to have organized separate things into separate functions, too.
Quick note: #481 would make SimpleSpan
hashable, since I figured that made sense to add while I was working on it.
Here's another report of further progress, in the hope it informs anyone after me reading this (or can serve as eventual documentation). I went with the approach described in the previous comment and ended up with an absolutely huge function (it's a big grammar). It worked well, but eventually vscode's clippy and/or rust analyzer really struggled, and took a long time. I boxed everything I could, but it wasn't enough, which made development more frustrating.
So I went back and split my mega function into multiple functions. There are big sub-sections of the grammar defined in each function, and I'm still returning multiple boxed parsers from each function, but putting some boundaries in the code appears to have helped and while the vscode 'clippy' phase still takes a while, it's now tolerable. And it's nicer to have organized separate things into separate functions, too.
thank you for the guidance. if by any chance you ever have the time and inclination to contribute examples of this to the crate, it would be immensely appreciated by yours truly and certainly many others too
The examples all (AFAICS) use
let
to define parsers. But there are use-cases where this isn't convenient:Neither of these things seems very easy when constructing a correct impl Parser<...> declaration for the parser function's return type is entirely unobvious. I mean, the parser itself seems straighforward (obviously I picked a trivial parser for the sake of the example):
but that parser can't be used anywhere (with chumsky 1.0.0-alpha.3) because the declared return type is missing a lot of trait bounds. Some guidance from the examples or the documentation would be very welcome.