Syntax: Stop parsing at __END__ and __DATA__

xsawyerx commented 4 years ago

It would be great if we could detect __END__ and __DATA__ and stop parsing then and there.

xsawyerx commented 4 years ago

I thought about this a lot. Here's where I'm at:

We could recognize __END__ and __DATA__
However, these are known as "this is where the parser must stop"
These means that recognizing them and parsing their content are different things

This means we have two options:

We either have a rule that handles them within Marpa
- If we see '^DATA$', we gobble everything until EOF or ^__END__$
- If we see ^__END__$, we stop parsing entirely
We remove these elements prior to parsing it in Marpa, basically an input cleanup phase

@gonzus I would especially appreciate your thoughts on this.

gonzus commented 4 years ago

This is a difficult problem for which I have never seen a comprehensive solution. It basically boils down to parsing (or at least recognizing) more than one syntax in the same file. The same thing happens (to a more painful extreme maybe) in HTML files that embed JS and CSS, and even some other server-side language such as PHP or Perl.

Last time I looked into this in detail, the standard tools (yacc / bison and lex / flex) were starting to add basic support for this, but I stopped looking and I am sad to say I have no idea what the current level of support is for this.

The approach you propose is reasonable, but in the same way as "discarding white space while lexing" might be problematic for some tools (such as the exact location of a comment), maybe you need more smarts when discarding anything between __DATA__ and __END__, or after __END__ altogether. So I would go with your proposal, and see how it goes.

xsawyerx commented 4 years ago

Thank you for your thoughts. I prepared an MR that implements this approach. It seems to not mess up the location of elements in the file because it simply comments POD out.

When it comes to __DATA__ and __END__:

__DATA__ can be opened as a bareword filehandle, but otherwise, it's not read by anything.
__END__ is fully ignored by Perl. It stops reading the file entirely when it encounters it.

For these reasons, I imagine they wouldn't need their own parser, just to be removed prior to parsing.

xsawyerx / guacamole

Syntax: Stop parsing at END and DATA #21

xsawyerx / guacamole

Syntax: Stop parsing at __END__ and __DATA__ #21

Syntax: Stop parsing at END and DATA #21