yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
883 stars 112 forks source link

Heavy stack usage -- option to throw exception if stack usage exceeds X #172

Closed maiple closed 3 years ago

maiple commented 3 years ago

The stack usage is quite heavy, and I have been experiencing stack overflows when using a 400 mb fiber (i.e. temporary stack). This causes a segfault which I can't catch. It would be helpful if there were a way to configure cpp-peglib to monitor its own stack usage and throw an exception if it exceeds a certain depth.

Example

Parsing this string with this grammar used 159529313 bytes (159 mb):

grammar:

    ROOM <- ~COMMENT* '<room>'i (~COMMENT* ROOMPROP)* ~COMMENT* '</room>'i
    ROOMPROP <- RP_OTHER
    RP_OTHER <- '<' < [a-zA-Z0-9]+ > '/>' / '<' $tag< [a-zA-Z0-9]+ > '>' (RP_OTHER* / TEXT) '</' $tag '>'
    TEXT <- ([^<>&] / ESCAPED_CHAR)+
    ESCAPED_CHAR <- '&' [^a-zA-Z0-9]+ ';'
    COMMENT <- '<!--' [^>]* '>'
    %whitespace  <-  [ \t\r\n]*

string:

<!--comment-->
<room>
  <caption></caption>
  <width>256</width>
  <height/>
</room>
yhirose commented 3 years ago

@maiple, I tried to reproduce the problem with peglint, but it seems your grammar has a problem.

# I copied the grammar to `a.peg` and the string to `a.xml`.
> peglint a.peg a.xml
a.xml:4:10: syntax error, unexpected '256', expecting <RP_OTHER>.

Anyway, could take a loot the following example which is in README? (I added NOTE: comment that explains how to properly release captures.

peg::parser parser(R"(
  ROOT      <- CONTENT
  CONTENT   <- (ELEMENT / TEXT)*
  ELEMENT   <- $(STAG CONTENT ETAG) // NOTE: This introduce a scope and guarantee that all the captures will be released when we leave the cope. 
  STAG      <- '<' $tag< TAG_NAME > '>'
  ETAG      <- '</' $tag '>'
  TAG_NAME  <- 'b' / 'u'
  TEXT      <- TEXT_DATA
  TEXT_DATA <- ![<] .
)");

One thing you could do is to add $(...) in RP_OTHER rule like this:

RP_OTHER <- $('<' < [a-zA-Z0-9]+ > '/>' / '<' $tag< [a-zA-Z0-9]+ > '>' (RP_OTHER* / TEXT) '</' $tag '>')

Hope it helps!

maiple commented 3 years ago

@yhirose I'm not sure why you are seeing that error. The grammar seems fine to me, and it works correctly on my machine.

Regardless, the problem in this issue is not the grammar per se, but the excessive stack space. Other grammars also suffer from heavy stack use. One solution would be to allow a configuration to cap stack usage.