Closed Varriount closed 5 years ago
I tried implementing this by reusing captures, but that doesn't fit will because of the nature of captures in NPeg, as they can be nested and are consumed by code blocks. I guess some explicit mechanism would be needed to mark a pattern like a capture, so i can be referenced at a later time.
Do you happen to know of another PEG implementation that solves this? I'd be interested in the notation used.
zevv: It appears this Ruby PEG library has backreferences - https://github.com/sander6/chomsky
The notation used is a function-call like syntax - captures are done using cap()
, whiile references to those captures are done using ref()
.
A heredco is represented as rule :heredoc { (r(/[A-Z]+/) >= cap(:delim)) & _.* & ref(:delim) }
Ok, I made an implementation of back refs, but I'm not really happy with the result because of the added complexity given the limited functionality it offers, IMHO.
If you want to check it out: https://github.com/zevv/npeg/tree/backref
Usage looks like this:
let p = peg "doc":
S <- *Space
doc <- +word * "<<" * Ref("sep", sep) * S * >heredoc * Backref("sep") * S * +word
word <- +Alpha * S
sep <- +Alpha
heredoc <- +(1 - Backref("sep"))
This will match the following subject:
This is a <<EOT here document
with multiple lines EOT end
and result in the capture here document\n with multiple lines
Note that the usage is clumsy: The here-doc leader is <<
, after which a separator
is matched (+Alpha in this case) and stored under the name ref
. Then the heredoc
rule is matched, which matches a sequence of characters which explicitly do not
match the stored ref. Then the backref itself is matched, which completes the
here document.
This works, but is not ideal, as the heredoc
rule has to be explicit in not matching
the heredoc separator string.
I do not understand how the Ruby peg library handles this, because I do not see anything similar in the example:
rule :heredoc { (r(/[A-Z]+/) >= cap(:delim)) & _.* & ref(:delim) }`
Lua's LPEG also supports back references, and it seems that here the are also explicit in not matching the terminator:
equals = lpeg.P"="^0
open = "[" * lpeg.Cg(equals, "init") * "[" * lpeg.P"\n"^-1
close = "]" * lpeg.C(equals) * "]"
closeeq = lpeg.Cmt(close * lpeg.Cb("init"), function (s, i, a, b) return a == b end)
string = open * lpeg.C((lpeg.P(1) - closeeq)^0) * close / 1
so maybe my current implementation is good enough.
Also, I do not really like the Ref()
and Backref()
syntax. Any ideas are welcome.
I mean, it's up to you to judge whether the complexity is ultimately worth it - I will admit that dynamic tokens are not something that come up in many languages.
If you are looking for optimization ideas, then theoretically a you can use an an array instead of a table for the backreferences - the names only matter during compilation, so they can be rewritten to inidex references.
I was actually originally planning, as a workaround for the lack of functionality, to have my parser emit the beginning token (the start of the heredoc), then handle that parsing manually. It would have been somewhat clumsy, but would have worked.
Quoting Varriount (2019-05-29 07:12:59)
I mean, it's up to you to judge whether the complexity is ultimately worth it - I will admit that dynamic tokens are not something that come up in many languages.
Well, in hindsight and with some refactoring the complexity ended up not being too bad. I never had a use case for backrefs myself, but I do see their potential use cases.
If you are looking for optimization ideas, then theoretically a you can use an an array instead of a table for the backreferences - the names only matter during compilation, so they can be rewritten to inidex references.
True, but I do like the naming option, refering to backrefs by number makes for a much less readable grammar.
Anyway, give it a go and let me know if this works for you; with some more testing I guess I'll merge it and put it in the next release. I think I might change the Ref() and Backref() names though, as their verbosity kind of stands out with the rest of the syntax.
-- :wq ^X^Cy^K^X^C^C^C^C
I mean, since the names are only needed for readability, they could be transparently converted to index references at compile time. Then, at runtime, a sequence or array could be used to store/retrieve captures.
Quoting Varriount (2019-05-30 08:08:32)
I mean, since the names are only needed for readability, they could be transparently converted to index references at compile time. Then, at runtime, a sequence or array could be used to store/retrieve captures.
Well, too late, the release is out :)
I merged the branch and released 0.11.0 yesterday. The implementation is pretty much ok, and the notation has been shortened to R() only:
R("name", ...)
to make a reference and
R("")
to refer to it later in the grammar. It works-for-me for as far as I whipped up some test cases, so we'll have to see how it holds up in real life.
Thanks!
(closes #7) <- does that work in email, github?
-- :wq ^X^Cy^K^X^C^C^C^C
Closed by ee7122cd3bb778fb0e608f3dfdf9023acd34d1c0
Being able to reference backreferences in match patterns would be nice, for the cases where something like bash heredocs are desired.