wincent / masochist

⛓ Website infrastructure for over-engineers
MIT License
78 stars 26 forks source link

Optimize static lexer #207

Closed wincent closed 1 year ago

wincent commented 1 year ago

Various ideas for speeding up the statically generated lexer.

Use if instead of switch

Older browser engines seemed to struggle to optimize large switch statements. Experimenting locally (editing the case statements to use else if instead) shows that this may no longer be the case.

Done: in 68e72e8c4.

Minimize property accesses

Avoid repeated reading and writing of this properties. Again, experimenting locally here shows that there is little to gain here.

Done: in 5b9c79ad9 for a 2.2% speed-up (although there is still more that could be done).

Decompose into many small functions

These might optimize sooner than one large function. Untested.

Update (2023-08-06): Did a small experiment on this.

Extracting a tiny method like this:

ignore() {
  this.tokenStart = this.index;
  this.state = START;
}

makes things quite a bit slower (756ms vs 710ms). So, not clear what the benefit might be, given that I expect the performance characteristics to change quite a bit from method to method.

Done: in cafdc4fd0 for an 8.1% speed-up.

Order conditionals based on state machine and grammar characteristics

The full lexer has over 50 states at the time of writing, but to illustrate, consider this simple machine:

Simple state machine diagram

The example is too trivial to actually be interesting, but we'll go through the exercise anyway. If the state machine compiles down to a big conditional of the form:

if (state === 0) {
  // ...
} else if (state === 1) {
  // ...
} else if (state === 2) {
  // ...
} else if (state === 3) {
  // ...
} else if (state === 4) {
  // ...
} else if (state === 5) {
  // ...
} else {
  // ...
}

then on entering the state machine, will enter the first branch on recognizing the first character, then each subsequent time through the conditional we will have to try 1 more branch until we get to the state we actually care about visiting.

That is, to scan an input like \u1234, we'll take the 1st branch, then the 2nd, then 3rd, then 4th, then 5th.

In an ideal world, we would always put the "most likely" branches near the start of the conditional. Now, in this trivial machine, there is no way we can really do that; moving any state closer to the front just ends up penalizing some other state. But in the real machine, there are states that the machine will typically spend more time in — probably states related to NAME tokens, or whitespace, being obvious candidates — and these could be put at the front.

Some can probably be identified by heuristics (like the number of immediate inbound edges, or transitive "distance" from the initial state) or by explicit annotations in the lexer definition (ie. hints added by a human). The latter could possibly be derived from an analysis of GraphQL language samples, so as to identify the most common tokens.

Inline or "flatten" states

There are lots of states in the lexer where once you are in a state, there are a number of transitions that you can take and anything else will cause the input to be rejected.

To draw on our previous example, once you've seen \u, you have to see 4 hexadecimal digits; anything else is an error. At the time of writing, this is expressed in the generated lexer as follows (in pseudo-code):

// There are a couple of states that can reach state 37, but both
// of them involve seeing a `\u` sequence, then... we have to
// recognize 4 hexadecimal digits:
if (state === 37) {
  if (ch in [A-Fa-f0-9]) {
    goto state 45
  } else {
    reject
  }
} else if (state === 45) {
  if (ch in [A-Fa-f0-9]) {
    goto state 49
  } else {
    reject
  }
} else if (state === 49) {
  if (ch in [A-Fa-f0-9]) {
    goto state 53
  } else {
    reject
  }
} else if (state === 53) {
  if (ch in [A-Fa-f0-9]) {
    goto state 26
  } else {
    reject
  }
}

In other words, instead of multiple trips through the main conditional, visiting all the branches, this could be (again, in pseudo-code):

if (state === 37) {
  if (ch in [A-Fa-f0-9]) {
    // Inline state 45
    index++
    if (ch in [A-Fa-f0-9]) {
      // Inline state 49
      index++
      if (ch in [A-Fa-f0-9]) {
        // Inline state 53
        index++
        if (ch in [A-Fa-f0-9]) {
          goto state 26
          continue
  reject
}

If the inlineable states are simple enough we could potentially flatten the nesting there like so:

if (state === 37) {
  if (
    ch in [A-Fa-f0-9] &&
    ch.at(index + 1) in [A-Fa-f0-9] &&
    ch.at(index + 2) in [A-Fa-f0-9] &&
    ch.at(index + 3) in [A-Fa-f0-9]
  ) {
    goto state 26
  } else {
    reject
  }
}

As a first pass, we can identify places in the code where we can do this that consist of states that look like this:

if (ch === expected)
  goto state X
else
  reject

and can inline any state "X" which isn't the target of another transition. Inlining might even be ok for states which are reachable from other transitions (ie. we could tolerate some duplication, if it doesn't blow up exponentially). Would need to test this carefully because I do suspect that function size is an important determinant of optimizability for the engine.

For some prior art, I added some inlining in f400e2c166, but I think we can do a lot more.

Use WebAssembly

Just a thought.

wincent commented 1 year ago

Loop inlining

Related to the idea of inlining, the main comment rule is ripe for inlining — as soon as it sees a # it goes into a loop in state 6:

Screenshot 2023-08-03-231003-AVupvSQ9@2x
            } else if (state === 6) {
                if (ch === 0x09 || ch >= 0x20 && ch <= 0xffff) {
                    /* Empty. */ ;
                }
                else {
                    // IGNORED token.
                    this.tokenStart = index;
                    this.state = 0;
                    continue;
                }
            } else if (state === 11) {

Same for the whitespace state (although this is only state 1, so the number of branches you have to visit to get to it is not so egregious):

Screenshot 2023-08-03-231429-otM1jJWt@2x
            } else if (state === 1) {
                if (ch === 0x09 || ch === 0x20) {
                    /* Empty. */ ;
                }
                else {
                    // IGNORED token.
                    this.tokenStart = index;
                    this.state = 0;
                    continue;
                }
            } else if (state === 2) {

One of the NAME accept states, 18, is also ripe for inlining:

Screenshot 2023-08-03-231655-xPfxPnmk@2x

Again we see that wherever I am doing /* Empty. */ I am just scanning forward. I have 10 of these places in the current lexer[^reordered]:

            if (state === 18) {
                if (ch >= 0x30 && ch <= 0x39 || ch >= 0x41 && ch <= 0x5a || ch === 0x5f || ch >= 0x61 && ch <= 0x7a) {
                    /* Empty. */ ;
                }
                else {
                    const token = new Token_1.default('NAME', this.tokenStart, index, input);
                    this.tokenStart = index;
                    this.state = 0;
                    return token;
                }
            } else if (state === 0) {

[^reordered]: Note that the reason this is the first state (an if and not an else if) is because I manually moved it in the built output to do some experimentation around reordering more likely states to the front.

Done: in 66084b475 for a 6% speed-up (an initial version, at least).

wincent commented 1 year ago

Going to close this as the static lexer now beats the reference lexer. Can always do more later, if I want to scratch the itch.