* lexer.rl: extract strings lexing to lexer-strings.rl

iliabylich commented 1 year ago

WIP, but there's something to share.

In this PR I've extracted everything related to strings to sub-lexer (lexer-strings.rl). Code in this branch doesn't handle all cases yet, but it is able to lex most string literals and it is able to parse test_parser.rb. As I understand it doesn't change anything:

$ truffleruby --engine.TraceCompilation --experimental-options -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'

7.519174168002792
5.833842421066947
[engine] opt done     Parser::Lexer#advance                                       |AST 10982|Time 5694(2469+3225)ms|Tier 2|Inlined   0Y 119N|IR 34651/68777|CodeSize 279227|Addr 0x128100000|Src lexer-F0.rb:8555
[engine] opt deopt    Parser::Lexer#advance                                       |AST 10982|Calls/Thres   42409/    3|CallsAndLoop/Thres  765766/ 1000|Src lexer-F0.rb:8555
2.351237821043469
4.065150417969562
[engine] opt done     Parser::Lexer#advance                                       |AST 10982|Time 4284(1653+2631)ms|Tier 2|Inlined   0Y 119N|IR 34661/69206|CodeSize 270899|Addr 0x12aad1000|Src lexer-F0.rb:8555
1.04369637707714
0.9216261199908331
0.9226390749681741
1.279196677962318
0.925871066050604
0.8468390660127625
0.8662178260274231
0.811310556018725
1.0800499020842835
0.7605728390626609

@eregon @headius Please correct me if I'm reading it wrong, but CodeSize is roughly the same. I changed most rules to have beginning_of_literal_pattern c_any with delegation to sub-lexer, but looks like it's not a bottleneck.

eregon commented 1 year ago

This looks interesting, I'll have a look this week

eregon commented 1 year ago

I replied in https://github.com/whitequark/parser/issues/871#issuecomment-1385570115 to keep all my measurements together (I hope you don't mind). In short, this is great and makes every metric on TruffleRuby significantly better.

Please correct me if I'm reading it wrong, but CodeSize is roughly the same. I changed most rules to have beginning_of_literal_pattern c_any with delegation to sub-lexer, but looks like it's not a bottleneck.

The CodeSize compared to what? To https://github.com/whitequark/parser/issues/871#issuecomment-1362218586 maybe? We can't really compare if we use different TruffleRuby versions (I use a local build of TruffleRuby master) and runtime config (I use --jvm, I think you are using --native above). In my measurements the CodeSize is significantly smaller in this PR. But the CodeSize is not the metric that matters the most. The metric that matters the most for TruffleRuby is the GraalGraphSize. That's not shown by TraceCompilation, only as [engine] opt failed ... Graph too big to safely compile ... if it's above the limit, and [engine] opt done if it's within the limit. This PR brings the GraalGraphSize from 180k to 153k, so this is awesome.

iliabylich commented 1 year ago

Performance on MRI, master branch:

0.22344800014980137
0.22889799997210503
0.22801800002343953
0.22188099985942245
0.215658999979496
0.23656500014476478
0.22172700008377433
0.2242219999898225
0.21810900000855327
0.2218509998638183
0.22798700002022088
0.22605099994689226

Performance on MRI, patch:

0.23099999991245568
0.2553759999573231
0.23117999988608062
0.2359160000924021
0.22565700020641088
0.24311299994587898
0.23387099988758564
0.2327790001872927
0.2283179999794811
0.23300900007598102
0.2435709999408573
0.23459000000730157
0.23156400001607835

The difference comes from introducing indirection (things that used to be ivars/locals are a part of a different sub-lexer now), but I think it's acceptable, I wanted to do this refactoring for a long time.

Performance on TruffleRuby, master branch:

8.871330896159634
6.567537926137447
[engine] opt done     Parser::Lexer#advance                                       |AST 14159|Time 7165(2129+5035)ms|Tier 2|Inlined   0Y 175N|IR 41506/87802|CodeSize 356400|Addr 0x11d000000|Src lexer-F0.rb:11507
[engine] opt deopt    Parser::Lexer#advance                                       |AST 14159|Calls/Thres   47382/    3|CallsAndLoop/Thres  931543/ 1000|Src lexer-F0.rb:11507
3.492685919860378
4.992397190071642
[engine] opt done     Parser::Lexer#advance                                       |AST 14159|Time 6075(1958+4117)ms|Tier 2|Inlined   0Y 175N|IR 41516/88231|CodeSize 350344|Addr 0x12044b000|Src lexer-F0.rb:11507
1.803598809055984
0.9968494030181319

Performance on TruffleRuby, patch:

8.562273496063426
7.1901286309584975
[engine] opt done     Parser::Lexer#advance                                       |AST 10937|Time 5058(2158+2901)ms|Tier 2|Inlined   0Y 113N|IR 34209/66939|CodeSize 271855|Addr 0x12adea000|Src lexer-F0.rb:8543
[engine] opt deopt    Parser::Lexer#advance                                       |AST 10937|Calls/Thres   45337/    3|CallsAndLoop/Thres  817912/ 1000|Src lexer-F0.rb:8543
2.910490944981575
4.281942526111379
[engine] opt done     Parser::Lexer#advance                                       |AST 10937|Time 4525(1771+2754)ms|Tier 2|Inlined   0Y 113N|IR 34219/67368|CodeSize 265671|Addr 0x1299f9000|Src lexer-F0.rb:8543
1.076639185892418
0.9211365671362728
0.933931423118338

CodeSize went down from 356400 to 271855 (23%), compilation time improved from 7165ms to 5058ms (29%), so it's definitely worth it. Merging.

I'll take a look at performance (for the first time :D) once I'm done with size optimisations.

iliabylich commented 1 year ago

@eregon @headius I suppose now it's ok to "inline" methods back to Ragel actions in extracted sub-lexer, right? Sub-lexer is quite small, but this magic with taking/returning p makes it less readable, so I'd like to revert those changes.

eregon commented 1 year ago

Quite small is still 4810-3108 = 1702 lines for LexerStrings#advance Ragel is pretty "efficient" at generating huge amounts of code: it turns a 938 lines .rl file into 5203 lines :/ Tweaking the command-line to show both #advance methods:

truffleruby 23.0.0-dev-f3db7ba6, like ruby 3.1.3, GraalVM CE JVM [x86_64-linux]
$ truffleruby --jvm --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=153000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E '#advance|^[0-9]'     
[engine] opt done   id=5978  LexerStrings#advance                               |Tier 1|Time  1047( 610+437 )ms|AST 4413|Inlined   0Y  57N|IR   6565/ 17508|CodeSize   88918|Addr 0x7f46b3884060|Timestamp 1660571591456|Src lexer-strings.rb:3108
[engine] opt deopt  id=5978  LexerStrings#advance                               |                                                                                                               |Timestamp 1660802465675|Src lexer-strings.rb:3108
[engine] opt done   id=5978  LexerStrings#advance                               |Tier 2|Time  1491( 594+897 )ms|AST 4413|Inlined  28Y  63N|IR   8210/ 19793|CodeSize  108376|Addr 0x7f46b3b31ea0|Timestamp 1662199674045|Src lexer-strings.rb:3108
[engine] opt deopt  id=5978  LexerStrings#advance                               |                                                                                                               |Timestamp 1662213063821|Src lexer-strings.rb:3108

[engine] opt done   id=5978  LexerStrings#advance                               |Tier 1|Time   870( 444+426 )ms|AST 4413|Inlined   0Y  63N|IR   6959/ 18826|CodeSize   93653|Addr 0x7f46b3b898c0|Timestamp 1663086958447|Src lexer-strings.rb:3108
[engine] opt done   id=5978  LexerStrings#advance                               |Tier 2|Time  1447( 580+867 )ms|AST 4413|Inlined  27Y  64N|IR   8092/ 19772|CodeSize  102448|Addr 0x7f46b3c928e0|Timestamp 1664634103981|Src lexer-strings.rb:3108

[engine] opt done   id=5939  Lexer#advance                                      |Tier 1|Time  4399(1526+2873)ms|AST 10580|Inlined   0Y 108N|IR  14318/ 43010|CodeSize  240189|Addr 0x7f46b405ce40|Timestamp 1668163715899|Src lexer-F0.rb:8555
[engine] opt done   id=5939  Lexer#advance                                      |Tier 2|Time  4104(1440+2665)ms|AST 10580|Inlined   0Y 108N|IR  12251/ 36971|CodeSize  207576|Addr 0x7f46b469e720|Timestamp 1672271968370|Src lexer-F0.rb:8555
0.06205050200014739
0.06224363499995889
0.06184562899989032
0.06213130400010414
0.06224232599993229

To compare numbers more easily:

LexerStrings#advance |Tier 1|Time   870( 444+426 )ms|AST  4413|Inlined   0Y  63N|IR   6959/ 18826|CodeSize   93653|Src lexer-strings.rb:3108
Lexer#advance        |Tier 1|Time  4399(1526+2873)ms|AST 10580|Inlined   0Y 108N|IR  14318/ 43010|CodeSize  240189|Src lexer-F0.rb:8555

LexerStrings#advance |Tier 2|Time  1447( 580+867 )ms|AST  4413|Inlined  27Y  64N|IR   8092/ 19772|CodeSize  102448|Src lexer-strings.rb:3108
Lexer#advance        |Tier 2|Time  4104(1440+2665)ms|AST 10580|Inlined   0Y 108N|IR  12251/ 36971|CodeSize  207576|Src lexer-F0.rb:8555

We should look at Tier 1 numbers here, because in Tier 2 LexerStrings#advance inlines 27 methods and so of course is bigger due to that. So LexerStrings is about half the size in number of AST nodes (maybe the most intuitive notion of size), about half in number of compiler nodes (IR), more than half in number of calls (Inlined), a bit less than half in CodeSize. LexerStrings is much faster to compile, which is great (these measurements are on JVM CE no libgraal so compile time is not precise as it includes warmup of Graal itself). A CodeSize of 94 KB in Tier 1 is a huge method (that's a huge amount of assembly code for a single method). So from a JIT point of view, LexerStrings#advance is already huge, but at the same time much better than Lexer#advance. From a JIT point of view, smaller methods are always better (until they become reasonable size let's say ~10 lines then it's less clear) because it compiles faster (compilation is not strictly linear with amount of code, e.g., the register allocator) and because the JIT can then choose to inline what makes more sense and optimize that part better.

If the method becomes bigger, the JIT has less room, is of course forced to include everything in the bigger method but then might not be able to inline the most important part(s) of the method, and the slow paths included might prevent some optimizations (e.g. might give up on loop peeling or duplication because the graph is so big already). We can see here TruffleRuby (Graal) chose to inline 27 out of 64 calls, so 1) it's likely too much code to inline all of them and/or 2) some calls are less important and it's better for performance to not inline those.

It would be useful if @headius can check whether LexerStrings#advance compiles on JRuby. I suspect LexerStrings#advance might be too big for JRuby as it is.

So I would suggest to keep it as it is. For TruffleRuby it's worse for warmup and possibly worse for peak performance to make it bigger, but it should still compile. You are the maintainer though, so you are the one to decide. I would suggest to keep it as separate methods unless it proves too much a hassle in practice.

eregon commented 1 year ago

That said, from a quick (and maybe wrong) regexp search (\bp\n), it seems there are only 2 or few methods which return p, I only found extend_string_eol_heredoc_intertwined and e_heredoc_nl. Both of these are small, I think it's fine and wont' change things significantly to inline those. But maybe I missed some other methods returning p?

eregon commented 1 year ago

With a slightly better search with \bp =:

p = e_heredoc_nl(p)
p = current_literal.heredoc_e - 1
p = extend_string_eol_heredoc_intertwined(p)
p = extend_interp_var(current_literal)

heredoc_e seems an attribute reader, so not relevant. extend_interp_var is not trivial and called about 8 times in the .rb so that would blow up the size quite a bit. extend_string_eol_heredoc_intertwined seems called 9 times in the .rb, so still a sizable increase given the code would be "forced inlined" 9 times. e_heredoc_nl doesn't seem called at all?

iliabylich commented 1 year ago

e_heredoc_nl doesn't seem called at all?

You are still on the old version of the code, e_heredoc_nl doesn't exist anymore.

eregon commented 1 year ago

I used this PR's commit, 5b16cd0a8a50e1df7092cf6e62c1c1602cdcf9fa for the above.

headius commented 1 year ago

The new changes look great! With no flags to JRuby, everything compiles except the two advance methods.

Parser::Lexer#advance is now under 10k JRuby IR instructions, but forcing it to compile to JVM bytecode still produces too large of a method (JVM bytecode method size limit is 64k).
Parser::LexerStrings#advance is under 3k IR instructions, and can be forced to compile and run as JVM bytecode (we are considering bumping up our 1000 instruction limit).
The best performance, however, still comes from leaving the advance methods interpreted and allowing everything else to compile. Even when forcing the LexerStrings advance to JVM bytecode, the JVM decides it is still too big to native compile.

I will play with some flags and see if I can get more of this to compile in JRuby with recent changes.

whitequark / parser

* lexer.rl: extract strings lexing to lexer-strings.rl #905