yhirose / cpp-peglib

A single file C++ header-only PEG (Parsing Expression Grammars) library
MIT License
916 stars 113 forks source link

Improving AST optimizer #230

Closed mingodad closed 2 years ago

mingodad commented 2 years ago

After been bitten by the AST optimizer several times and trying to understand it I think found one possible problem/improvement on it, giving a simple grammar in 2 styles. Style 1 explicit white space on token usage:

expr <- _ name _ (opcmp _ name _)* !.
~_ <- [ \t\n\r]*
opcmp <- ('==' / '~=' / '<=' / '>=' / '<' / '>' )
name <- [a-z]+

AST:

+ expr
  - name (e)
  - opcmp/0 (==)
  - name (c)

Style 2 white space embedded on tokens:

expr <- _ name (opcmp name)* !.
~_ <- [ \t\n\r]*
opcmp <- ('==' / '~=' / '<=' / '>=' / '<' / '>' ) _
name <- [a-z]+ _

AST:

+ expr
  + name
  + opcmp
  + name

It seems that the optimizer decide to show or not the node value based on number of child:

struct AstOptimizer {
  AstOptimizer(bool mode, const std::vector<std::string> &rules = {})
      : mode_(mode), rules_(rules) {}

  template <typename T>
  std::shared_ptr<T> optimize(std::shared_ptr<T> original,
                              std::shared_ptr<T> parent = nullptr) {
...
    if (opt && original->nodes.size() == 1) {  //////!!!!!!!<<<<< here it seems to decide ???????
      auto child = optimize(original->nodes[0], parent); /////!!!!!!<<<< here using the first child
      auto ast = std::make_shared<T>(*child, original->name.data(),
                                     original->choice_count, original->position,
                                     original->length, original->choice);
      for (auto node : ast->nodes) {
        node->parent = ast;
      }
      return ast;
    }
...
};

I think that it should use the number of non-ignorable child that when using white space embedded on tokens, see bellow pseudo code.

Pseudo code:

struct AstOptimizer {
  AstOptimizer(bool mode, const std::vector<std::string> &rules = {})
      : mode_(mode), rules_(rules) {}

  template <typename T>
  std::shared_ptr<T> optimize(std::shared_ptr<T> original,
                              std::shared_ptr<T> parent = nullptr) {
...
    size_t non_ignorable_size = 0;
    size_t non_ignorable_idx = 0;
    for (auto node : original->nodes) {
        if(!node.isHidden) {
           ++non_ignorable_size;
           non_ignorable_idx = this_idx;
        }
    }    
    if (opt && non_ignorable_size == 1) { ///!!!!!<<<<<< Use non_ignorable_size
      auto child = optimize(original->nodes[non_ignorable_idx], parent); ///!!!!!<<<<<< Use non_ignorable_idx
      auto ast = std::make_shared<T>(*child, original->name.data(),
                                     original->choice_count, original->position,
                                     original->length, original->choice);
      for (auto node : ast->nodes) {
        node->parent = ast;
      }
      return ast;
    }
...
};
mingodad commented 2 years ago

Same example on chpeg playground https://meimporta.eu/chpeg/: Style 1 explicit white space on token usage:

expr <- _ name _ (opcmp _ name _)* !.
_ {I} <- [ \t\n\r]*
opcmp <- ('==' / '~=' / '<=' / '>=' / '<' / '>' )
name <- [a-z]+

AST:

 OFFSET   LEN     ID NC FLG IDENT "DATA"
     0      6      0  3 R-- expr "e == c"
     0      1      3  0 ---   name "e"
     2      2      2  0 ---   opcmp "=="
     5      1      3  0 ---   name "c"

Style 2 white space embedded on tokens:

expr <- _ name (opcmp name)* !.
_ {I} <- [ \t\n\r]*
opcmp <- ('==' / '~=' / '<=' / '>=' / '<' / '>' ) _
name <- [a-z]+ _

AST:

 OFFSET   LEN     ID NC FLG IDENT "DATA"
     0      6      0  3 R-- expr "e == c"
     0      2      3  0 ---   name "e "
     2      3      2  0 ---   opcmp "== "
     5      1      3  0 ---   name "c"
yhirose commented 2 years ago

In the 2nd example, the token operator <...> should be used if you want the same result as the 1st one.

expr  <- _ name (opcmp name)* !.
opcmp <- < '==' / '~=' / '<=' / '>=' / '<' / '>' > _
name  <- < [a-z]+ > _
~_    <- [ \t\n\r]*