zesterer / chumsky

Write expressive, high-performance parsers with ease.
https://crates.io/crates/chumsky
MIT License
3.46k stars 143 forks source link

Memory leak in `recursive` when using `define` function #486

Open mattnenterprise opened 1 year ago

mattnenterprise commented 1 year ago

The recursive implementation leaks memory if it references itself and the parser definition is defined using the define function. See the following example that creates millions of parsers, but the memory is never released.

use chumsky::prelude::*;

#[derive(Debug, PartialEq)]
enum Chain {
    End,
    Link(char, Box<Chain>),
}

fn parser() -> impl Parser<char, Chain, Error = Simple<char>> {
    let mut chain = Recursive::<_, _, Simple<char>>::declare();

    chain.define(just('+')
        .then(chain.clone())
        .map(|(c, chain)| Chain::Link(c, Box::new(chain)))
        .or_not()
        .map(|chain| chain.unwrap_or(Chain::End)));

    chain
}

fn main() {
    for _n in 1..100_000_000 {
        parser();
    }
}

I originally found this issue in the jaq filter parser here.

zesterer commented 1 year ago

Hi,

It seems like jaq is using 0.9. The 1.0 release is a from-scratch rewrite of chumsky. Do you experience the same issue with 1.0?

mattnenterprise commented 1 year ago

jaq generates lots of compile errors when I try to upgrade from 0.9 to 1.0, and I'm not knowledgeable enough about the code base to upgrade them all.

I was able to reproduce the memory issue with the most current commit on main 6bb31a5f070e1289a11ea4a0193c7ccb236c81fa. I did so with the following code.

use chumsky::prelude::*;

#[derive(Debug, PartialEq)]
enum Chain {
    End,
    Link(char, Box<Chain>),
}

fn parser<'a>() -> impl Parser<'a, &'a str, Chain, extra::Err<Simple<'a, char>>> {
    let mut chain = Recursive::declare();

    chain.define(just::<_, _, extra::Err<Simple<char>>>('+')
        .then(chain.clone())
        .map(|(c, chain)| Chain::Link(c, Box::new(chain)))
        .or_not()
        .map(|chain| chain.unwrap_or(Chain::End)));

    chain
}

fn main() {
    for _n in 1..100_000_000 {
        parser();
    }
}

I actually based this example on an example in the recursive documentation here

mattnenterprise commented 1 year ago

So I installed miri to get more info about the memory leak. Both of the following were ran on main 6bb31a5f070e1289a11ea4a0193c7ccb236c81fa

Here is the example code I ran with miri.

use chumsky::prelude::*;

#[derive(Debug, PartialEq)]
enum Chain {
    End,
    Link(char, Box<Chain>),
}

fn parser<'a>() -> impl Parser<'a, &'a str, Chain, extra::Err<Simple<'a, char>>> {
    let mut chain = Recursive::declare();

    chain.define(just::<_, _, extra::Err<Simple<char>>>('+')
        .then(chain.clone())
        .map(|(c, chain)| Chain::Link(c, Box::new(chain)))
        .or_not()
        .map(|chain| chain.unwrap_or(Chain::End)));

    chain
}

fn main() {
    parser();
}

Here is the output from miri when running this example.

error: memory leaked: alloc819 (Rust heap, size: 32, align: 8), allocated here:
   --> /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9
    |
98  |         __rust_alloc(layout.size(), layout.align())
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: inside `std::alloc::alloc` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9: 98:52
    = note: inside `std::alloc::Global::alloc_impl` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:181:73: 181:86
    = note: inside `<std::alloc::Global as std::alloc::Allocator>::allocate` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:241:9: 241:39
    = note: inside `alloc::alloc::exchange_malloc` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:330:11: 330:34
    = note: inside `std::boxed::Box::<std::rc::RcBox<chumsky::recursive::Indirect<'_, '_, &str, Chain, chumsky::extra::Full<chumsky::error::Simple<'_, char>, (), ()>>>>::new` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:217:9: 217:20
    = note: inside `std::rc::Rc::<chumsky::recursive::Indirect<'_, '_, &str, Chain, chumsky::extra::Full<chumsky::error::Simple<'_, char>, (), ()>>>::new` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/rc.rs:397:27: 397:94
note: inside `chumsky::recursive::Recursive::<chumsky::recursive::Indirect<'_, '_, &str, Chain, chumsky::extra::Full<chumsky::error::Simple<'_, char>, (), ()>>>::declare`
   --> /home/mmccoy/code/chumsky/src/recursive.rs:128:42
    |
128 |               inner: RecursiveInner::Owned(RefC::new(Indirect {
    |  __________________________________________^
129 | |                 inner: OnceCell::new(),
130 | |             })),
    | |______________^
note: inside `parser::<'_>`
   --> examples/memory_leak.rs:10:21
    |
10  |     let mut chain = Recursive::declare();
    |                     ^^^^^^^^^^^^^^^^^^^^
note: inside `main`
   --> examples/memory_leak.rs:22:5
    |
22  |     parser();
    |     ^^^^^^^^

error: memory leaked: alloc881 (Rust heap, size: 24, align: 8), allocated here:
  --> /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9
   |
98 |         __rust_alloc(layout.size(), layout.align())
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: inside `std::alloc::alloc` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9: 98:52
   = note: inside `std::alloc::Global::alloc_impl` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:181:73: 181:86
   = note: inside `<std::alloc::Global as std::alloc::Allocator>::allocate` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:241:9: 241:39
   = note: inside `alloc::alloc::exchange_malloc` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:330:11: 330:34
   = note: inside `std::boxed::Box::<chumsky::combinator::Map<chumsky::combinator::OrNot<chumsky::combinator::Map<chumsky::combinator::Then<chumsky::primitive::Just<char, &str, chumsky::extra::Full<chumsky::error::Simple<'_, char>, (), ()>>, chumsky::recursive::Recursive<chumsky::recursive::Indirect<'_, '_, &str, Chain, chumsky::extra::Full<chumsky::error::Simple<'_, char>, (), ()>>>, char, Chain, chumsky::extra::Full<chumsky::error::Simple<'_, char>, (), ()>>, (char, Chain), [closure@examples/memory_leak.rs:14:14: 14:26]>>, std::option::Option<Chain>, [closure@examples/memory_leak.rs:16:14: 16:21]>>::new` at /home/mmccoy/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:217:9: 217:20
note: inside `parser::<'_>`
  --> examples/memory_leak.rs:12:5
   |
12 | /     chain.define(just::<_, _, extra::Err<Simple<char>>>('+')
13 | |         .then(chain.clone())
14 | |         .map(|(c, chain)| Chain::Link(c, Box::new(chain)))
15 | |         .or_not()
16 | |         .map(|chain| chain.unwrap_or(Chain::End)));
   | |__________________________________________________^
note: inside `main`
  --> examples/memory_leak.rs:22:5
   |
22 |     parser();
   |     ^^^^^^^^

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

note: the evaluated program leaked memory, pass `-Zmiri-ignore-leaks` to disable this check

error: aborting due to 2 previous errors

If I define the parser with the recursive function like so there is no memory leak reported by miri.

use chumsky::prelude::*;

#[derive(Debug, PartialEq)]
enum Chain {
    End,
    Link(char, Box<Chain>),
}

fn parser<'a>() -> impl Parser<'a, &'a str, Chain, extra::Err<Simple<'a, char>>> {
    recursive(|chain|
        just::<_, _, extra::Err<Simple<char>>>('+')
        .then(chain.clone())
        .map(|(c, chain)| Chain::Link(c, Box::new(chain)))
        .or_not()
        .map(|chain| chain.unwrap_or(Chain::End))
    )
}

fn main() {
    parser();
}
CraftSpider commented 1 year ago

I feel like I encountered this issue before and determined it would be pretty hard to fix, since clones of a Recursive::declare, which is strong, are themselves strong, and changing this could make it easy to accidentally drop your recursive parser:

let expr = Recursive::declare();

// Current Behavior: Strong clone - and since you now have an `Rc` containing itself, it leaks
let a = foo(expr.clone());

expr.define(a);
// If the clone was weak instead: Haha, whoops, you drop the original `expr` here and since the clone is weak it panics when you use the parser
expr.clone()
zesterer commented 1 year ago

This feels like it's effectively equivalent to full-on garbage collection in the general case, which is not really possible to solve in Rust while hiding it at an API level.

CraftSpider commented 1 year ago

Honestly, the easiest answer may be a weak_clone method that gets you a version of the parser to use inside itself.

Zij-IT commented 1 year ago

Give me a couple hours (:sweat_smile:), but I think I have a solution for this!

Zij-IT commented 1 year ago

Alright, so this is meant to be a rough-draft of a solution, but I feel like it helps to solve the original problem, and prevents the user from accidentally calling the wrong clone (weak_clone), which I feel would make it too easy to accidentally (and silently) create a memory leak.

jkylling commented 4 weeks ago

Hey! I've also encountered a memory leak with jaq because of this. I'm wondering if we can solve this use case by restricting and extending Chomsky's interface for recursive functions?

Within jaq there is some code which creates two mutually recursive parsers. The code is roughly

fn filter() -> impl Parser<Token, Spanned<Filter>, Error = Simple<Token>> + Clone {
  let parser1 = Recursive::declare();
  let parser2 = Recursive::declare();

  parser1.define(transform1(parser1.clone(), parser2.clone()));
  parser2.define(transform2(parser1.clone(), parser2.clone()));

  // At this point parser1 has incremented the strong reference count of parser2, and parser1
  // has incremented the strong reference count of parser2, so none of them will ever be dropped.
  return parser1
}

The issue is that parser1.clone() and parser2.clone() increment the strong counts of the OnceCells of the parsers. So if we change the code to use weak references when cloning instead we should not have any trouble. But not so quick. Let's say parser2.clone() in the above instead incremented the weak count of the OnceCell of parser2. Then when we get to the end of the filter function we drop parser2, since there is only one strong reference to it. When we then try to use parser1 we get a panic, as the weak reference to parser2 is no longer valid.

This suggests that parser1 and parser2 can only be dropped together. Maybe we can change the API to enforce this?

Let's add a guarding struct EitherParser which ensures that the parsers are always dropped together, and let's add recursive_bind of the following form:

#[allow(missing_docs)]
pub fn recursive_bind<
    'a,
    I: Clone,
    O1,
    P1: Parser<I, O1, Error=E> + 'a,
    P2,
    R: Into<EitherParser<P1, P2>>,
    F: FnOnce(Recursive<'a, I, O1, E>) -> R,
    E: Error<I>,
>(
    f: F,
) -> EitherParser<Recursive<'a, I, O1, E>, P2> {
    let mut left_parser = Recursive::declare();
    let either = f(
        Recursive(match &left_parser.0 {
            RecursiveInner::Owned(x) => RecursiveInner::Unowned(Rc::downgrade(x)),
            RecursiveInner::Unowned(_) => unreachable!(),
        })
    ).into();
    left_parser.define(either.0);
    EitherParser(left_parser, either.1)
}

#[allow(missing_docs)]
#[derive(Clone)]
pub struct EitherParser<P1, P2>(P1, P2);

impl <P1, P2> From<(P1, P2)> for EitherParser<P1, P2> {
    fn from(value: (P1, P2)) -> Self {
        EitherParser(value.0, value.1)
    }
}

impl <I, O, P1, P2> Parser<I, O> for  EitherParser<P1, P2>
where
    I: Clone,
    P1: Parser<I, O>,
{
    type Error = P1::Error;

    fn parse_inner<D: Debugger>(&self, debugger: &mut D, stream: &mut StreamOf<I, Self::Error>) -> PResult<I, O, Self::Error>
    where
        Self: Sized
    {
        self.0.parse_inner(debugger, stream)
    }

    fn parse_inner_verbose(&self, d: &mut Verbose, s: &mut StreamOf<I, Self::Error>) -> PResult<I, O, Self::Error> {
        self.0.parse_inner_verbose(d, s)
    }

    fn parse_inner_silent(&self, d: &mut Silent, s: &mut StreamOf<I, Self::Error>) -> PResult<I, O, Self::Error> {
        self.0.parse_inner_silent(d, s)
    }
}

impl <P1, P2> EitherParser<P1, P2> {
    fn flip(self) -> EitherParser<P2, P1> {
        EitherParser(self.1, self.0)
    }

    fn left(&self) -> &P1 {
        &self.0
    }

    fn right(&self) -> &P2 {
        &self.1
    }
}

Then we can define recursive and recursive_2 as

pub fn recursive<
    'a,
    I: Clone,
    O,
    P: Parser<I, O, Error = E> + 'a,
    F: FnOnce(Recursive<'a, I, O, E>) -> P,
    E: Error<I>,
>(
    f: F,
) -> Recursive<'a, I, O, E> {
    recursive_bind(|recursive| (f(recursive), ())).0
}

#[allow(missing_docs)]
pub fn recursive_2<
    'a,
    I: Clone,
    O1,
    O2,
    P1: Parser<I, O1, Error=E> + 'a,
    P2: Parser<I, O2, Error=E> + 'a,
    R: Into<EitherParser<P1, P2>>,
    F: FnOnce(Recursive<'a, I, O1, E>, Recursive<'a, I, O2, E>) -> R,
    E: Error<I>,
>(
    f: F,
) -> EitherParser<Recursive<'a, I, O1, E>, Recursive<'a, I, O2, E>> {
    recursive_bind(|left| recursive_bind(|right| f(left, right).into().flip()).flip())
}
...

We should also make Recursive::declare non-public, to prevent accidents with the API.

I've tested a modified version of jaq with chomsky 0.9.3 and this function, and the memory leak of jaq was gone (loop { jaq_parse::defs() } was sufficient to reproduce).