Code size reduction for advance method

headius commented 1 year ago

See #871 for longer discussion about the size of the advance method and the problems it causes for JIT-enabled runtimes.

This PR is a start at "outlining" (as opposed to inlining) repetitive code blocks from the advance method by moving them into utility methods. Each commit details the change made plus the reduction in the F1 advance instruction count on JRuby 9.4.

My process for outlining, for anyone that wishes to reproduce it:

Use the flay gem to analyze the code, looking for duplicate sections of code.
Example the findings, ignoring tiny items (simple ivar assignments for example) and blocks of code that have state transitions or multiple "out" variables.
Outline as much of the remaining duplicate code blocks as possible.

At the time of this PR creation, it reduces the F1 advance size from 19235 to 15573. It may improve performance on JRuby slightly, but the method is still too large to fully JIT.

This PR also reduces the size of the F0 advance method from 11691 to 10235—still far too large to JIT but a good reduction.

headius commented 1 year ago

There's lots of repetition of these escape calls, but I'm unsure whether there's enough to warrant outlining them. For example:

@escape = encode_escape(tok(p - 2, p).to_i(16))

...appears 9 times. Reducing it to a single call would drop maybe a half-dozen instructions on JRuby, so outlining just this only reduces the instruction count by a few dozen. In aggregate, however, outlining similar sequences may amount to hundreds of instructions.

iliabylich commented 1 year ago

Sorry to say but it makes code cryptic. Also there's no way to "call" any Ragel DSL from extracted methods without inlining them again into actions, so I can't promise that it will not get re-written back in a month.

iliabylich commented 1 year ago

Actually, some parts look good enough to be merged as is (extracting constants, turning the whole action into a single method), but all methods that take (or even return) p look too complex, not because it's complex on its own, but because of how much context should be tracked to understand and change anything, and now there's more.

Does Ragel support ivars for p? Let me check it quickly first. If it doesn't hurt performance most of you changes will be OK, other locals can also be turned into ivars hopefully.

iliabylich commented 1 year ago

It's variable p @p;, with a some local changes I was able to run a full test suite, benchmarks (parsing test_parser.rb):

Before:

0.26340699987486005
0.24227400007657707
0.2389680000487715
0.25204400019720197
0.24021800002083182
0.2268710001371801
0.22673400002531707
0.2313620001077652
0.23082199995405972
0.23081399989314377
0.22557200002484024
0.2313079999294132
0.22624099999666214
0.2364000000525266
0.2290109999012202

After:

0.23141700006090105
0.22681099991314113
0.24080900009721518
0.22510300017893314
0.23191100009717047
0.2234749998897314
0.23476299992762506
0.2259360000025481
0.22546899993903935
0.23276899987831712
0.22822699998505414
0.22566900006495416
0.22536400007084012
0.2339080001693219
0.22188700013794005

Looks like there's no difference, so I'll send a PR in a minute. With this change it'll be possible to convert most (if not all) actions into methods and call them with %{ action } instead of %action (i.e. instead of going to action for the source code we'll go to the method definition), which sounds good enough for me.

iliabylich commented 1 year ago

I've merged #900 , so p is now @p everywhere, and taking and passing it back is not necessary anymore. Could you rebase you PR please? I understand that my change affected your branch a lot, so if you have no time I could do prepare a PR myself.

Also current_literal is taken from literal everywhere which is a method, so it doesn't have to be passed as well.

headius commented 1 year ago

The change you made makes this PR impossible to rebase, unfortunately. Nearly every commit breaks.

Could you undo that and reapply it on top of mine?

FWIW, @p will have more overhead on all implementations than just passing p so we might reduce code complexity slightly but probably end up slower.

If there's somewhere we could live chat about this it would be a great help. I have just pushed the last change I intend to make using this "flay" methodology, but there are other less-valuable changes possible.

iliabylich commented 1 year ago

On Ragel DSL that prevents us from turning everything into methods, I wonder if there's a performant way to maybe somehow "return" information about needed fnext/fbreak/fgoto from the action as a value and "dispatch" it maybe from a lambda that has access to lvars that are mutated by this DSL. Pseudo-code:

NO_ACTION = 0
FBREAK = 1
FNEXT_EXPR_BEGIN = 2
FNEXT_EXPR_VALUE = 4
# ...

def on_pattern_x
  # do something ...
  return FBREAK | FNEXT_EXPR_BEGIN
end

dispatch = ->(action) {
  if action & FBREAK
    fbreak;
  end

  if action & FNEXT_EXPR_BEGIN
    fnext expr_begin;
  end

  # ...
}

'pattern_x'  => { dispatch(on_pattern_x) }

I know, it's terrible

headius commented 1 year ago

I've done as much as I can do with this methodology. I got about halfway down the "flay" report before the "mass" got too low for it to be worth further refactoring.

Overall, this represents a reduction in F1 size from 19235 instructions to 13849, a reduction of nearly 5400 instructions. It also reduces the F0 lexer from 11691 to 9872, less 1819 instructions.

It cannot be merged now because p became @p. If given a diff based on this PR I could apply it, or #900 could be reverted and remade atop my PR.

iliabylich commented 1 year ago

FWIW, @p will have more overhead on all implementations than just passing p so we might reduce code complexity slightly but probably end up slower.

From my naive POV ivar/lvar access sounds like an array/hash lookup, so it should be roughly the same in terms of performance. Or is it because locals can be stored in registers on some platforms after JITing? I see no difference on MRI and as I understand on 3.2 with object shapes ivars will be even faster.

If there's somewhere we could live chat about this it would be a great help.

Sure, feel free to drop me an email with a link to any public chat and we can talk. Or is there a public JRuby/TruffleRuby chat that I could join?

headius commented 1 year ago

Pseudo-code

Yeah with the work in this PR a large part of the remaining duplicated code is entangled with state changes, so it's hard to go further without a way to avoid that. I'm not sure moving more of these variables into instance variables is the right approach, and allocating a carrier object would reduce code size but mean another allocation for every call to advance.

I would like to get this PR merged before we look at other more creative ways to reduce the lexer size.

Also current_literal is taken from literal everywhere

Aha, good catch! We can apply that on top of this PR once it merges again.

The p change is good too, but pretty please can we do it after this PR? I have 20 commits I would have to rebase.

iliabylich commented 1 year ago

The p change is good too, but pretty please can we do it after this PR? I have 20 commits I would have to rebase.

Done, there are no conflicts now.

allocating a carrier object would reduce code size but mean another allocation for every call to advance

I thought there are no allocations, it's a pure bit arithmetic that IIRC doesn't allocate any Ruby objects (because small numbers are "embedded" in VALUE in MRI)

headius commented 1 year ago

ivar/lvar access sounds like an array/hash lookup, so it should be roughly the same

Local variable access will indeed be the fastest in nearly all situations, since it does not have any additional lookup to do (the index of the variable in the stack, or the register in JIT, are emitted directly into the code).

Instance variables have improved over time, but even with all the latest changes in CRuby it will still require at least a couple memory indirections, plus some cache validation checks.

Array access should be slightly faster than the best instance variable optimizations on CRuby, but probably not by much.

Hash access is very slow in comparison to all of these.

Your benchmark of p vs @p may indicate that it's not enough overhead to be a problem, but I know @p will always be slower than p even on the best JITs because it will at least be a memory access (rather than a register) unless the object is completely inlined too.

is there a public JRuby/TruffleRuby chat

JRuby's chat is on Matrix: https://matrix.to/#/#jruby:matrix.org

I'm active there today along with @enebo who has also been looking at this work.

headius commented 1 year ago

it's a pure bit arithmetic that IIRC doesn't allocate any Ruby objects

Hmm... perhaps we can figure out a way to pack all the state variables into a single 64-bit Integer? That would be a single return value, immediate for most values on CRuby, and we could update multiple states at once.

This is well outside my understanding of Ragel's state machine, though.

Done, there are no conflicts now.

This PR is ready to merge any time! I ran rake test for each change along the way, so it should be pretty clean.

Your p change will require some methods to be tweaked.

headius commented 1 year ago

Thank you for the naming suggestions. I just did a best guess on what to call some of these. I will get these methods renamed after lunch.

headius commented 1 year ago

Updated method names based on review comments. All set!

iliabylich commented 1 year ago

@headius Thanks!

eregon commented 1 year ago

I missed this until now (and subscribed to the repo to see the next changes), it looks great :)

Array access should be slightly faster than the best instance variable optimizations on CRuby, but probably not by much.

A small note: Array access is typically slower because it needs to check bounds (no need with a Shape), normalize the index if negative and it has one more indirection.

Indeed local vars are still much better than ivars, it's a register access or less (e.g. the JIT can see where the value goes and not even need a register) vs an actual read/write into memory + the Shape check + other checks if not using Shapes. OTOH if there is any block in the method and that's not completely escape analyzed (unlikely here), that will mean local variables become memory accesses because those blocks can observe them, so then they become very similar to ivars with Shapes, but still potentially with more information about what can access it when (e.g., avoids redundant "shape checks" for ivars if no block in between).

whitequark / parser

Code size reduction for advance method #899