zevv / npeg

PEGs for Nim, another take
MIT License
330 stars 22 forks source link

Backreferences #7

Closed Varriount closed 5 years ago

Varriount commented 5 years ago

Being able to reference backreferences in match patterns would be nice, for the cases where something like bash heredocs are desired.

zevv commented 5 years ago

I tried implementing this by reusing captures, but that doesn't fit will because of the nature of captures in NPeg, as they can be nested and are consumed by code blocks. I guess some explicit mechanism would be needed to mark a pattern like a capture, so i can be referenced at a later time.

Do you happen to know of another PEG implementation that solves this? I'd be interested in the notation used.

Varriount commented 5 years ago

zevv: It appears this Ruby PEG library has backreferences - https://github.com/sander6/chomsky

The notation used is a function-call like syntax - captures are done using cap(), whiile references to those captures are done using ref().

A heredco is represented as rule :heredoc { (r(/[A-Z]+/) >= cap(:delim)) & _.* & ref(:delim) }

zevv commented 5 years ago

Ok, I made an implementation of back refs, but I'm not really happy with the result because of the added complexity given the limited functionality it offers, IMHO.

If you want to check it out: https://github.com/zevv/npeg/tree/backref

Usage looks like this:

  let p = peg "doc":
    S <- *Space
    doc <- +word * "<<" * Ref("sep", sep) * S * >heredoc * Backref("sep") * S * +word
    word <- +Alpha * S
    sep <- +Alpha
    heredoc <- +(1 - Backref("sep"))

This will match the following subject:

This is a <<EOT here document
  with multiple lines EOT end

and result in the capture here document\n with multiple lines

Note that the usage is clumsy: The here-doc leader is <<, after which a separator is matched (+Alpha in this case) and stored under the name ref. Then the heredoc rule is matched, which matches a sequence of characters which explicitly do not match the stored ref. Then the backref itself is matched, which completes the here document.

This works, but is not ideal, as the heredoc rule has to be explicit in not matching the heredoc separator string.

I do not understand how the Ruby peg library handles this, because I do not see anything similar in the example:

rule :heredoc { (r(/[A-Z]+/) >= cap(:delim)) & _.* & ref(:delim) }`

Lua's LPEG also supports back references, and it seems that here the are also explicit in not matching the terminator:

equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1

so maybe my current implementation is good enough.

zevv commented 5 years ago

Also, see

https://github.com/zevv/npeg/tree/backref#backreferences

for documentation.

zevv commented 5 years ago

Also, I do not really like the Ref() and Backref() syntax. Any ideas are welcome.

Varriount commented 5 years ago

I mean, it's up to you to judge whether the complexity is ultimately worth it - I will admit that dynamic tokens are not something that come up in many languages.

If you are looking for optimization ideas, then theoretically a you can use an an array instead of a table for the backreferences - the names only matter during compilation, so they can be rewritten to inidex references.

I was actually originally planning, as a workaround for the lack of functionality, to have my parser emit the beginning token (the start of the heredoc), then handle that parsing manually. It would have been somewhat clumsy, but would have worked.

zevv commented 5 years ago

Quoting Varriount (2019-05-29 07:12:59)

I mean, it's up to you to judge whether the complexity is ultimately worth it - I will admit that dynamic tokens are not something that come up in many languages.

Well, in hindsight and with some refactoring the complexity ended up not being too bad. I never had a use case for backrefs myself, but I do see their potential use cases.

If you are looking for optimization ideas, then theoretically a you can use an an array instead of a table for the backreferences - the names only matter during compilation, so they can be rewritten to inidex references.

True, but I do like the naming option, refering to backrefs by number makes for a much less readable grammar.

Anyway, give it a go and let me know if this works for you; with some more testing I guess I'll merge it and put it in the next release. I think I might change the Ref() and Backref() names though, as their verbosity kind of stands out with the rest of the syntax.

-- :wq ^X^Cy^K^X^C^C^C^C

Varriount commented 5 years ago

I mean, since the names are only needed for readability, they could be transparently converted to index references at compile time. Then, at runtime, a sequence or array could be used to store/retrieve captures.

zevv commented 5 years ago

Quoting Varriount (2019-05-30 08:08:32)

I mean, since the names are only needed for readability, they could be transparently converted to index references at compile time. Then, at runtime, a sequence or array could be used to store/retrieve captures.

Well, too late, the release is out :)

I merged the branch and released 0.11.0 yesterday. The implementation is pretty much ok, and the notation has been shortened to R() only:

R("name", ...)

to make a reference and

R("")

to refer to it later in the grammar. It works-for-me for as far as I whipped up some test cases, so we'll have to see how it holds up in real life.

Thanks!

(closes #7) <- does that work in email, github?

-- :wq ^X^Cy^K^X^C^C^C^C

zevv commented 5 years ago

Closed by ee7122cd3bb778fb0e608f3dfdf9023acd34d1c0