zesterer / chumsky

Write expressive, high-performance parsers with ease.
https://crates.io/crates/chumsky
MIT License
3.63k stars 155 forks source link

Question: How to reduce symbol length? #672

Closed eternal-flame-AD closed 2 months ago

eternal-flame-AD commented 2 months ago

Would appreciate some advice! I read the advice section in the docs and used boxing in each function and used choice for any >=3 or's. My compile time was okay (~10 seconds on an LTO'ed build, excl. dependencies) but I get giant symbol names which feels wasteful, for example this 29 byte function get a 30kb+ name, is the only option here stripping the binary?

Some naive grepping give me this histogram, I attached the full name of the longest symbol. symbol.txt image

Example of my code:

pub fn class<'tokens, 'src: 'tokens>() -> impl Parser<
    'tokens,
    ParserInput<'tokens, 'src>,
    Class<'src>,
    extra::Err<Rich<'tokens, Token<'src>, Span>>,
> + Clone {
    let tags = tags();

    let kw = just(Token::Keyword(Keyword::Class));

    let name = select! {
        Token::Ident(name) => name
    };

    let maybe_extends = just(Token::Keyword(Keyword::Extends))
        .ignore_then(path())
        .map(Some)
        .or(empty().map(|_| None));

    let body = just(Token::Ctrl('{'))
        .ignore_then(
            choice((
                var_decl()
                    .then_ignore(just(Token::Ctrl(';')))
                    .map(|vardecl| (Some(vardecl), None, None, None)),
                function().map(|func| (None, Some(func), None, None)),
                typedef().map(|typedef| (None, None, Some(typedef), None)),
                import().map(|import| (None, None, None, Some(import))),
            ))
            .repeated()
            .collect::<Vec<_>>(),
        )
        .then_ignore(just(Token::Ctrl('}')))
        .map(|body| {
            let mut members = Vec::new();
            let mut functions = Vec::new();
            let mut typedefs = Vec::new();
            let mut imports = Vec::new();
            for item in body {
                match item {
                    (Some(vardecl), None, None, None) => members.push(vardecl),
                    (None, Some(func), None, None) => functions.push(func),
                    (None, None, Some(typedef), None) => typedefs.push(typedef),
                    (None, None, None, Some(import)) => imports.push(import),
                    _ => unreachable!(),
                }
            }
            (members, functions, typedefs, imports)
        });

    tags.then_ignore(kw)
        .then(name)
        .then(maybe_extends)
        .then(body)
        .map(
            |(((tags, name), extends), (members, functions, typedefs, imports))| Class {
                tags,
                name,
                extends,
                typedefs,
                imports,
                members,
                functions,
            },
        )
}
zesterer commented 2 months ago

You can use .boxed() to switch to dynamic dispatch, which avoids large types. See here. Note that this doesn't always come with a performance hit: LLVM is often able to devirtualise the parsers and still statically knit them together. Sometimes, performance can even improve!

eternal-flame-AD commented 2 months ago

I only boxed expressions, statements and primitives like ifs, I will try boxing them all tomorrow and report back. Thanks

eternal-flame-AD commented 2 months ago

I boxed everything that is longer than 3 levels that doesn't require Send (lexer part) and got down to about 25 kb and the next largest is 14 kb. My build time got noticeably slower actually (I suppose LTO is trying to see through the box?) but it's still acceptable. I attached the new longest symbol here if there is anything to see but thanks for the help! If there is no more thing to do feel free to close this, thanks! symbol.txt image

zesterer commented 2 months ago

I think that's about all there is, sadly (at least: until rustc gets a bit smarter about symbol generation for nested types). The same thing happens with complex Iterator chains, although of course chumsky's pervasive use of the combinator pattern makes it much more common here.

eternal-flame-AD commented 2 months ago

Thanks for the help, and the project overall of course!