xsawyerx / guacamole

Guacamole is a parser toolkit for Standard Perl. It provides fully static BNF-based parsing capability to a reasonable subset of Perl.
https://metacpan.org/pod/Guacamole
20 stars 8 forks source link

Syntax: Stop parsing at __END__ and __DATA__ #21

Closed xsawyerx closed 4 years ago

xsawyerx commented 4 years ago

It would be great if we could detect __END__ and __DATA__ and stop parsing then and there.

xsawyerx commented 4 years ago

I thought about this a lot. Here's where I'm at:

This means we have two options:

  1. We either have a rule that handles them within Marpa
    • If we see '^DATA$', we gobble everything until EOF or ^__END__$
    • If we see ^__END__$, we stop parsing entirely
  2. We remove these elements prior to parsing it in Marpa, basically an input cleanup phase

@gonzus I would especially appreciate your thoughts on this.

gonzus commented 4 years ago

This is a difficult problem for which I have never seen a comprehensive solution. It basically boils down to parsing (or at least recognizing) more than one syntax in the same file. The same thing happens (to a more painful extreme maybe) in HTML files that embed JS and CSS, and even some other server-side language such as PHP or Perl.

Last time I looked into this in detail, the standard tools (yacc / bison and lex / flex) were starting to add basic support for this, but I stopped looking and I am sad to say I have no idea what the current level of support is for this.

The approach you propose is reasonable, but in the same way as "discarding white space while lexing" might be problematic for some tools (such as the exact location of a comment), maybe you need more smarts when discarding anything between __DATA__ and __END__, or after __END__ altogether. So I would go with your proposal, and see how it goes.

xsawyerx commented 4 years ago

Thank you for your thoughts. I prepared an MR that implements this approach. It seems to not mess up the location of elements in the file because it simply comments POD out.

When it comes to __DATA__ and __END__:

For these reasons, I imagine they wouldn't need their own parser, just to be removed prior to parsing.