zesterer / chumsky

Write expressive, high-performance parsers with ease.
https://crates.io/crates/chumsky
MIT License
3.63k stars 155 forks source link

Pratt parser still overflows stack #660

Closed xldenis closed 3 months ago

xldenis commented 3 months ago

Even when using pratt parsers its possible to get stack overflow, which is unexpected... See the following minimal test case:

fn main() {
    use chumsky::pratt::*;
    use chumsky::prelude::*;
    let mut to_parse = "true==>".repeat(8000);
    to_parse.push_str("true");
    let parser = parser().parse(&to_parse);
    for err in parser.errors(){
        eprintln!("{err}");
    }
    assert!(!parser.has_errors());
}
use chumsky::extra::*;
use chumsky::prelude::*;
fn parser<'a>() -> impl Parser<'a, &'a str, (), Err<Simple<'a, char, SimpleSpan>>> + Clone {
    use chumsky::pratt::*;
    just("true").map(|_| ()).pratt((
        infix(right(2), just("==>"), |l, r| ()),
    ))
}

Granted this example is artificial, but in real grammars it occurs with a nesting level far less than 8000.

zesterer commented 3 months ago

Pratt parsers don't avoid the need for recursion. Any non-regular context-free grammar still theoretically requires unbounded recursion (and hence infinite memory) to parse: that's just the reality of things. If you want to avoid stack overflow happening until much later, make sure you've got the spill-stack feature enabled.

xldenis commented 3 months ago

requires unbounded recursion (and hence infinite memory)

Yea, I just figured it wouldn't be done as literal recursion on the stack, it makes sense than an arbitrarily nested set of operators would require an arbitrarily large amount of memory to parse (if only to store the results).

make sure you've got the spill-stack feature enabled.

Ah i didn't know about that feature.

xldenis commented 3 months ago

Hmm actually since spill-stack is enabled by default, this is still an issue for me.

Granted I don't know much about the implementations of PEG parsers, but I would have figured that the precedence parsing would first parse a sequence of factors / operators and then perform a fold over that to accumulate the results. In that case there would not be an explicit arbitrary recursion, no?

zesterer commented 3 months ago

But that's not how Pratt parsers work. You can definitely avoid the recursion for this case by parsing this specific pattern iteratively using the traditional combinators, but the Pratt parsing algorithm explicitly uses recursion to handle cases like this because branch behaviour can occur at each precedence level. There is nothing here that is behaving as it shouldn't.

If your goal, for unclear reasons, really is to parse the pattern true==> repeated 8,000 times, I would suggest framing the syntax differently and using different combinators (just("true").then(just("==>")).repeated() will do).

Here, try passing a similarly absurd input to rustc, a compiler that's undoubtedly of production quality:

error: rustc interrupted by SIGSEGV, printing backtrace

/playground/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/librustc_driver-a30ca400d2857f33.so(+0x3031f56)[0x7f7a2df09f56]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7a2ad89420]
/playground/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/librustc_driver-a30ca400d2857f33.so(+0x464fba9)[0x7f7a2f527ba9]

### cycle encountered after 3 frames with period 4
/playground/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/librustc_driver-a30ca400d2857f33.so(+0x50ec657)[0x7f7a2ffc4657]
/playground/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/librustc_driver-a30ca400d2857f33.so(+0x464fe83)[0x7f7a2f527e83]
/playground/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/librustc_driver-a30ca400d2857f33.so(+0x50ec657)[0x7f7a2ffc4657]
/playground/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/librustc_driver-a30ca400d2857f33.so(+0x464fe83)[0x7f7a2f527e83]
### recursed 63 times

/playground/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/librustc_driver-a30ca400d2857f33.so(+0x50ec657)[0x7f7a2ffc4657]

note: rustc unexpectedly overflowed its stack! this is a bug
note: maximum backtrace depth reached, frames may have been lost
note: we would appreciate a report at https://github.com/rust-lang/rust
help: you can increase rustc's stack size by setting RUST_MIN_STACK=16777216
error: could not compile `playground` (bin "playground")
xldenis commented 3 months ago

In my actual grammar i'm parsing a real programming language with a complex operator precedence hierarchy, the real grammar crashes on expressions of ~100 operators since the stack usage of the pratt parser grows with the number of operators as well.

I suppose I'll have to do what you suggested and manually implement the operator parsing.

zesterer commented 3 months ago

It sounds like your requirements here are quite unusual. If it helps, the custom operator is a useful way to drop out into Rust code in the middle of chumsky to perform parsing, you will probably be able to find a more effective way of parsing the patterns you're seeing with it.

xldenis commented 3 months ago

It sounds like your requirements here are quite unusual.

I'm writing a parser for the language Boogie, which has ~20-30 binary/unary operators across 10 precedence levels, the official test suite contains examples of expressions with extremely deep nesting.

zesterer commented 3 months ago

I'll be honest, this seems like an unusual case. A grammar like that necessarily requires unbounded recursion (and hence unbounded memory) to parse in the extreme, so I think the only reasonable complaint to be made of chumsky specifically is that perhaps it uses too much stack memory per level of recursion, for which spill-stack is the solution. I'm not sure what else there is to be done to resolve this. If you really, really want to parse inputs with absurd levels of nesting, you're almost certainly better off hand-writing a parser that is optimised for that extremely specific use-case.