Large grammars break NPeg

Varriount commented 3 years ago

For very large grammars (such as SQL) NPeg's code generation mechanism will fail:

..\npeg\src\npeg\codegen.nim(340, 18) Error: case statement has too many cases for computed goto

(another problem here is that the error message doesn't indicate what part of the grammar is the actual cause)

zevv commented 3 years ago

Yeah, that's a tough one to diagnose because it is Nim that generates the error, not NPeg itself.

I recommend taking care of proper ordering of your rules, as stated in the manual: any rule that is declared before its use will be inlined, possible recursively. This might grow the generated code pretty quck, or even exponentially.

To minimize size, declare your grammar top down.

https://github.com/zevv/npeg#ordering-of-rules-in-a-grammar

Varriount commented 3 years ago

I'm not sure it's the order that is the problem - I suspect it might be the fact that most rules are declared like this:

some_statement <-
  variation * one * of * some * statement |
  variation * two * of * another * statement |
  ...

NPeg appears to split such top-level choices into separate states.

zevv commented 3 years ago

Could you share your (failing) grammar?

Varriount commented 3 years ago

Sure. The code is here: https://github.com/Varriount/squint

The sample.nim file is the main entrypoint. I've only been able to compile it by removing the computedGoto pragma, and using the following command for compilation:

nim c --run --maxLoopIterationsVM:100000000 -d:npegInlineMaxLen:5 .\sample.nim

Also, for some reason adding -d:npegTrace causes the resulting executable to immediately crash, with no output.

zevv commented 3 years ago

Dude. really. You made a complete SQL parser in NPeg? The agony!

zevv commented 3 years ago

Well, I'm actually impressed with NPeg to see it can handle the grammar and compilation does not take three hours. The bad news is that your grammar is indeed just too large and Nim chokes on the generated code - even when disabling inlining with -d:npegInlineMaxLen=0 your grammar still adds up to 14260 NPeg IR instructions, which is just too much for the computed goto.

The good news is that when I disable the explicit .computedGoto. pragma from the npeg codegen, it compiles just fine (although it takes some time, the generated C file is over 4KLOC).

I've prepared a branch that emits a normal case (without the .computedGoto.) for large grammers, let me know if this works or you.

Branch name: no-computed-goto-for-large-grammars

Varriount commented 3 years ago

The grammar is mostly translated from PostgreSQL's own grammar files - it did take some time to automate the translation and coax NPeg into performing the correct comparisons, but not as much as writing it all from scratch would have taken.

Regarding the rules, was my initial guess correct? Does NPeg split top-level choices into multiple states?

zevv commented 3 years ago

Right - you generated it, nice. I was already starting to question your sanity :)

About your ordered choice question: NPeg splits everything into separate states. Basically, the grammar gets translated to IR code (adding up to about 17k instructions for your grammar), where each IR instruction gets represented by a distinct state. So this is not specific to the ordered choice, the same happens for all other operators.

I'd be happy to merge the code for not emitting a computed goto for large grammars; the computed goto is only an /optimization/, it does not make the code more correct.

zevv commented 3 years ago

0.25.0 has been released, omitting computedGoto for grammars with more then 10k instructions.

zevv / npeg

Large grammars break NPeg #36