zevv / npeg

PEGs for Nim, another take
MIT License
330 stars 22 forks source link

Memorization & Bryan Ford #44

Closed keiviv closed 2 years ago

keiviv commented 2 years ago

Huge thanks for nPEG — love it a lot, even its quirky syntax — as it gets highlighted, unlike common PEG. Love your other work as well.

PEGs creator — Bryan Ford — has a page which (besides other things) lists PEG implementations in various languages. I suggest to ask him to add Nim and its both PEG implementations. His email and twitter are at the index.

In my opinion it would also be a good opportunity to poke his brain on implementing some memorization for your amazing nPEG. I know that contacting people is not everyone's favorite thing, but here are some reasons for it:

You might be busy or have other plans, but if there be time to improve nPEG in terms of memorization — contacting the dude can do no harm.

In any case — thank you, and best wishes.

zevv commented 2 years ago

Hey, thanks for the note.

I tried to contact Ford a few years back: I dropped him a mail with a short description of NPeg and a pretty grammar graph, but unfortunately no response as of yet. Let's ping him, he might still pick that up: @bford :)

As for the other subjects, I have a confession to make: I never read up on or implemented a Packrat parser, so I have no idea on how it compares to the VM implementation used by NPeg and I'm afraid Ford will just see me for the fraud I am :) Note that NPeg is basically just an implementation of Ierusalimschy's LPEG in Lua, albeit with an nice extra step of compiling the interpreter VM into native Lua, so there is not very much original work in it.

I basically consider NPeg "done" for now, as I can not think of any more features I would like to add. There are a few rough edges that people occasionally hit, mostly having to do with the API and backtracking, but I have no ideas on how to actually improve these issues at this time.

keiviv commented 2 years ago

I see. Well, let's hope @bford adds your NPeg and std PEGs to his Santa list of PEG-related Projects — along with Nim itself.

I thought you are still looking for performance improvements — after reading your note at the end of the readme:

Memoization: I guess it would be possible to add (limited) memoization to improve performance, but no clue where to start yet.

Thanks for the clarification. Bryan is still a great source on packrat — probably the greatest — if you ever get interested in doing it fully natively. And his info page references everything available on it, just in case.

keiviv commented 2 years ago

Some info from the papers. LPEG:

Back in 1972, Aho & Ullman presented an algorithm to parse any Generalized Top-Down Parsing Language (GTDPL, PEG ancestor’s formalism) in linear time. The algorithm is based on the determinism of these grammars: given a non-terminal and a subject’s position, either the non-terminal matches n characters or it fails. The algorithm builds a table that, for each pair 〈nonterminal × input position〉, stores the respective result of the match. This table can be filled backwards, from the end of the subject to its beginning, in constant (for a given grammar) time for each entry.

A serious inefficiency of the above algorithm is that most values that it computes are never used; for each input position, there is generally only a small number of nonterminals that can be considered to match there. In 2002, Ford proposed Packrat, an adaptation of the original algorithm that uses lazy evaluation to avoid that inefficiency.

Even with this improvement, however, the space complexity of the algorithm is still linear on the subject size (with a somewhat big constant), even in the best case. As its author himself recognizes, this makes the algorithm not befitting for parsing “large amounts of relatively flat” data. However, unlike parsing tools, regular-expression tools aim exactly at large amounts of relatively flat data.

To avoid these difficulties, we did not use the Packrat algorithm for LPEG. To implement LPEG we created a virtual parsing machine, not unlike Knuth’s parsing machine, where each pattern is represented as a program for the machine. The program is somewhat similar to a recursive-descendent parser (with limited backtracking) for the pattern, but it uses an explicit stack instead of recursion.

And Packrat:

Packrat parser can consume many times more working storage than the size of the original input text. For this reason there are some application areas in which packrat parsing is probably not the best choice. For example, for parsing XML streams, which have a fairly simple syntax but often encode large amounts of relatively flat, machine-generated data, the power and flexibility of packrat parsing is not needed and its storage cost is not justified.

On the other hand, for parsing complex modern programming languages in which the source code is usually written by humans and the top priority is the power and expressive- ness of the language, the space cost of packrat parsing is probably reasonable.

So the VM choice for LPEG was deliberate.