Using a custom VM for parsing

gvanrossum commented 4 years ago

This is a thought I've had in my head for quite a while now. This is not a fully worked-out design, but I think I can do it. The big question is, will it actually be faster? There's overhead for opcode decoding; but there will be significantly less code, and more data and code locality.

My VM design has a stack of Frames representing node invocations, corresponding to the C call stack in the current version (unlike the Python VM, which still uses C stack frames). Each Frame has a reference to a Rule, and some current state including the mark, the cut flag, the current alternative, the next item in that alternative, and list of values to pass into the action. All state that changes during execution of the parser is contained in some Frame, or in the Parser construct (which is unchanged from the current design). A new Frame initialized to all zeros except for the Rule pointer and the initial mark represents the initial state.

Rules are static read-only data structures containing a few flags (mostly related to looping constructs) and an array of Alternatives.

Alternatives are static read-only data structures containing an array of Items and an Action.

Items are actually arrays of opcodes and arguments, or something like that. Most operations either expect a token (by token type) or a rule (by rule ID), but there are a few special opcodes to help with lookaheads, loops, cut and the like (I haven't fully researched the set of special opcodes we'll need, but I think it'll be a handful). The final opcode of each Alternative is always ACTION, which causes the Alternative's Action to be executed, passing it the collected values.

Actions are represented as functions (or perhaps a single big C function with a giant switch).

The main execution switch is something like this (I hope you can imagine the data structures from the description above and the code below):

top:
  opcode = frame->Rule.alts[frame->alt].items[frame->item++];
  switch (opcode & 0xff) {
  case STRING: v = string_token(p); break;
  // A few others...
  case CUT: frame->cut = 1; goto top;
  case TOKEN: v = expect_token(p, opcode>>8); break;
  case RULE:
    frame = push_new_frame_on_stack(p, rules[opcode>>8]);
    goto top;
  case ACTION:  // Always the last opcode in an alternative
    v = call_action(p, opcode>>8, frame->vals, frame->val);
    frame = pop_frame_from_stack(p);
    goto below;
  }

below:
  if (v) {  // This could be inlined in the cases above
    frame->vals[frame->val++] = v;
    goto top;
  }

next:
  if (frame->cut) goto fail;
  frame->alt++;
  frame->item = 0;
  p->mark = frame->mark;
  if (!frame->Rule.alts[frame->alt]) goto fail;
  goto top;

fail:
  frame = pop_frame_from_stack(p);
  goto next;

I have though of a few other opcodes needed related to loops, lookaheads and handling the start rule and overall success or failure; I'll provide more detail later.

gvanrossum commented 4 years ago

Status update: See the pegenvm branch in we-like-parsing. The sketch above doesn't really do it justice, and contains several subtle bugs (for example, it doesn't manipulate mark on failure). Anyway, I changed a few things about the design. But I've got it parsing. ~Actions haven't been implemented yet; when it succeeds parsing the input, it will crash trying to convert a dummy value (void *)1 to an AST.~

~Note: there's no code generation yet; first I want to get the little VM working.~

gvanrossum commented 4 years ago

Status:

[x] Cut
[x] Actions
[x] Optional tokens
[x] Optional rules
[x] Line numbers and column offsets
[x] Loop
[x] Loop, at-least-once
[x] Loop, delimited b.a+
[x] Lookaheads
[x] Memoization
[x] Performance testing on xxl.py
[x] Left-recursion
[x] Selective memoization
[ ] Unit testing
[ ] Infrastructure
[x] Code generation (basic)
[ ] Code generation (as a new subcommand, vm)
[ ] Lots of cleanup

gvanrossum commented 4 years ago

Okay, the first timing results are in! For our infamous xxl.py file:

old parser: 4.47 seconds
new parser: 3.22 seconds
vm parser: 1.84 seconds

Now, the vm parser currently uses a much simpler grammar, because I haven't got parser generating working yet. But still, this gives me hope, because there's also still some perf tuning to be done.

gvanrossum commented 4 years ago

Some more systematic timings (note: CPython compiled --with-pydebug):

old parser: 4.49
new parser: 3.33
vm parser: 2.00

gvanrossum commented 4 years ago

Even more exciting numbers without pydebug!

old parser: 1.26
new parser: 1.19
vm parser: 0.65

pablogsal commented 4 years ago

I am starting to do some benchmarks with xxl.py and this are some initial results using Linux perf tool:

Screenshot from 2020-06-03 09-45-17

Interestingly, checking that if a particular rule is memorized is one of the most expensive parts due to the linked list traversal if I understand correctly.

Seems that the switch itself and the branch for the negative lookahead are the most expensive parts of the VM code:

Screenshot from 2020-06-03 09-46-54

pablogsal commented 4 years ago

Hummmm, it seems that the slowdown is due to the extra calls to _PyPegen_is_memoized. Parsing xxl.py the new parser without the VM calls _PyPegen_is_memoized 40373060 times and the VM parser does it 66264588 times. This is 64% more!! And looking at the previous data, _PyPegen_is_memoized is one of the bottlenecks of both parsers!

gvanrossum commented 4 years ago

The reason is that the VM doesn't yet have selective memoization!

gvanrossum commented 4 years ago

Hm, I implemented selective memoization, and I now measure 318095942 calls. But that's still 8 times your 40373060. What's wrong here?

pablogsal commented 4 years ago

Hm, I implemented selective memoization, and I now measure 318095942 calls. But that's still 8 times your 40373060. What's wrong here?

Can you confirm that with the new non-vm parser you also get 40373060 calls?

gvanrossum commented 4 years ago

Not even sure how to do that. :-(

pablogsal commented 4 years ago

Not even sure how to do that. :-(

I added a new attribute to the parser p->number_of_calls, initialize it to 0, add 1 every time we call _PyPegen_is_memoized and print it after the parsing has finished in _PyPegen_run_parser_from_string after calling _PyPegen_vmparser.

For the non-vm parser, add also a print in _PyPegen_run_parser_from_string after calling _PyPegen_run_parser.

gvanrossum commented 4 years ago

Is that code checked in? I was using PyPegen_get_memo_statistics() but it returns a table per rule ID.

gvanrossum commented 4 years ago

I'm sorry, I am running out of time -- I need to pack for my trip.

pablogsal commented 4 years ago

Is that code checked in?

No, but you can use this patch:

diff --git a/Parser/pegen/pegen.c b/Parser/pegen/pegen.c
index 1b4afcec60..715bd58b08 100644
--- a/Parser/pegen/pegen.c
+++ b/Parser/pegen/pegen.c
@@ -681,6 +681,7 @@ _PyPegen_get_memo_statistics()
 int  // bool
 _PyPegen_is_memoized(Parser *p, int type, void *pres)
 {
+    p->calls++;
     if (p->mark == p->fill) {
         if (_PyPegen_fill_token(p) < 0) {
             p->error_indicator = 1;
@@ -1055,6 +1056,7 @@ _PyPegen_Parser_New(struct tok_state *tok, int start_rule, int flags,
     p->mark = 0;
     p->fill = 0;
     p->size = 1;
+    p->calls = 0;

     p->errcode = errcode;
     p->arena = arena;
@@ -1196,9 +1198,11 @@ _PyPegen_run_parser_from_string(const char *str, int start_rule, PyObject *filen

     if (parser_flags & PyPARSE_VMPARSER) {
         result = _PyPegen_vmparser(p);
+        printf("VM parser number of calls: %i\n", p->calls);
     }
     else {
         result = _PyPegen_run_parser(p);
+        printf("Normal parser number of calls: %i\n", p->calls);
     }
     _PyPegen_Parser_Free(p);

diff --git a/Parser/pegen/pegen.h b/Parser/pegen/pegen.h
index 2f9ee7116d..8e3336350c 100644
--- a/Parser/pegen/pegen.h
+++ b/Parser/pegen/pegen.h
@@ -74,6 +74,7 @@ typedef struct {
     growable_comment_array type_ignore_comments;
     Token *known_err_token;
     int level;
+    int calls;
 } Parser;

diff --git a/my_test.py b/my_test.py
new file mode 100644
index 0000000000..904392f526
--- /dev/null
+++ b/my_test.py
@@ -0,0 +1,7 @@
+import _peg_parser
+
+with open('./Tools/peg_generator/xxl.py') as datafile:
+    data = datafile.read()
+
+_peg_parser.parse_string(data, vmparser=True)
+_peg_parser.parse_string(data)

and then do:

❯ ./python.exe my_test.py
VM parser number of calls: 66264588
Normal parser number of calls: 40373060

I'm sorry, I am running out of time -- I need to pack for my trip.

No problem. Have a great trip!

gvanrossum commented 4 years ago

I get

VM parser number of calls: 48948561
Normal parser number of calls: 40373060

Better than before but still more than the "new" parser.

gvanrossum commented 4 years ago

Moreover when I count insert_memo calls instead of is_memoized calls, I get

VM parser number of calls:       12045225
Normal parser number of calls:    8141218

Perhaps the VM parser has more rules?

gvanrossum commented 4 years ago

Certainly the parsers don't have exactly the same set of synthesized rules (groups, loop etc.).

For example the VM implements b.a+ differently -- it is effectively (a b)* a which means the final iteration gets halfway into (a b), fails to find b, backtracks, and then re-parses to get the single a. Hopefully a is memoized.

I can't rule out bugs in the memoization code either. Or possibly the list of memoized rules is differently...?

gvanrossum commented 4 years ago

Hm, one concern: OP_RETURN_LEFT_REC calls is_memoized to recover the saved position, while the non-vm PEG parser has this in a local variable. This accounts for roughly the difference is is_memoized calls.

gvanrossum commented 4 years ago

Okay, with this adjustment I measure the same number of calls to is_memoized (one more, presumably for the root).

However I still see 50% more insert_memo calls. Running out of time for the investigation though. I'll see if I can push the changes I have so far.

gvanrossum commented 4 years ago

In an offline moment I realized the cause is that pop_frame() calls insert_memo() for memoizing rules, and that includes leftrec rules.

I think I have a way to do leftrec without extra or modified opcodes, instead having a separate flag per rule indicating left-recursiveness. It would mean that push_frame() and pop_frame() and and OP_RETURN and 1-2 other places would have to check this flag, but presumably this is still a lot cheaper than calling insert_memo() redundantly.

I do worry whether the project is doomed if the memo operations dominate the runtime this much.

gvanrossum commented 4 years ago

Alas, after making it use the full grammar, it's actually slower:

old parser: 1.33
new parser: 1.22
vm parser: 2.05

I'm going to give up on this idea.

we-like-parsers / cpython

Using a custom VM for parsing #128