zevv / npeg

PEGs for Nim, another take
MIT License
330 stars 22 forks source link

AST building, another take #31

Closed liquidev closed 2 years ago

liquidev commented 3 years ago

Currently, AST building with NPeg is quite difficult and bug-prone. You have to maintain a stack of nodes and use a bunch of relatively inefficient workarounds to build a nice, inspectable, traversable abstract syntax tree.

My proposal is to allow the user to specify their own Node type, and add primitives for building abstract syntax trees. Here's my idea for the API:

import std/strutils

type
  NodeKind = enum
    nkNumber, nkIdent, nkCall

  Node = ref object
    case kind: NodeKind
    of nkNumber: number: float
    of nkIdent: ident: string
    of nkCall:
      fn: Node
      args: seq[Node]

let p = peg(expr, ast = Node):
  ident <- {'a'..'z', '_'} * *{'a'..'z', '_', '0'..'9'}:
    # perhaps we could assign to result but i don't know
    # whether NPeg reserves it or not
    build Node(kind: nkIdent, ident: $0)
  number <- +Digit:
    build Node(kind: nkNumber, number: parseFloat($0))
  # AST captures would be done with A(), referencing
  # a capture inside of the code block would be done with node(n)
  # NPeg could detect whether the capture would result in
  # multiple nodes and expose nodes(n) which returns a
  # seq[Node] instead of a single Node
  call <- A(ident) * '(' * A(expr * *(',' * expr)) * ')':
    build Node(kind: nkCall, fn: node(1), arg: nodes(2))
  expr <- call | ident | number

Usage of node(n) on a sequence capture would result in either an error or the first node being returned. Usage of nodes(n) on a singular capture would result in an error or a sequence with only that node being returned.

zevv commented 3 years ago

I do like the proposal and how clean the use is, just some random thoughts as they bubble up:

liquidev commented 3 years ago

I'll justify the second argument by saying that parsing structured grammars into AST is a very common use case when building scripting languages (which is a thing at least I do), and also parsing data formats (such as JSON). As for the first argument, I'd say that this:

call <- A(ident) * '(' * A(expr * *(',' * expr)) * ')':
  build Node(kind: nkCall, fn: node(1), arg: nodes(2))

is much more cleaner than the equivalent manual stack method, which is what I use in cflang, putting aside the fact that cflang's grammar for calls is a bit different than your usual fn(a, b, c):

  callArgs <- expr * *([tkComma] * expr)
  call <- (callOp * ?callArgs) ^ 1:
    var args: seq[Node]
    while not (p.top.kind == nkCall and not p.top.done):
      args.add p.pop
    reverse args
    p.top.sons.add(args)
    p.top.done = true

Compared to my proposed solution, you're inducing extra cognitive load on the user to try and understand where the stack elements come from, and what is done to them. In my version, it's immediately obvious that the AST captures are used.

zevv commented 3 years ago

I hear you, I do. This is how I do the above, though:

https://github.com/zevv/mathpipe/blob/master/mp.nim#L60

You'll have to agree that does not look too bad.

liquidev commented 3 years ago

I personally do not like the placeholder idea, it feels quite flaky and easy to mess up like I did.

Varriount commented 3 years ago

While I personally think the current functionality is fine as-is, I could see someone building an "extension" module that builds on top of NPeg to provide AST generation capabilities.

zevv commented 2 years ago

Closing as part of the spring cleaning, feel free to reopen if needed.