ykjit / yk

yk packages
https://ykjit.github.io/yk/
Other
29 stars 7 forks source link

A basic register allocator #1303

Closed ltratt closed 2 months ago

ltratt commented 2 months ago

Previously we used a spill allocator that looked a bit like a register allocator, but wasn't. In essence, we were hiding huge amounts of implicit dependencies, at multiple levels: the x86 level (what registers are we clobbering etc.); the meta-tracing level (what are we passing around over trace jumps etc).

This commit introduces a "proper" register allocator. It is a rather basic linear scan register allocator that spills a lot more than it needs to, but it still does some useful register allocation. One way to see this is through benchmarking: big_loop.lua (counting down from 1000000000 to 0) is about 40% faster after this change (from 25.7s -> 15.3s).

If I take a random chunk of should-be-similar code (modulo the indexes, which have changed for boring reasons) the spill allocator produces code like:

; %16: i32 = load %2
mov r12, [rbp-0x18]
mov r12d, [r12]
mov [rbp-0x64], r12d
; %17: i32 = add %16, 4294967295i32
mov r12d, [rbp-0x64]
mov r13, 0xFFFFFFFFFFFFFFFF
add r12d, r13d
mov [rbp-0x68], r12d

and the new register allocator produces:

; %12: i32 = load %2
mov r14, [rbp-0x08]
mov r14d, [r14]
; %13: i32 = add %12, 4294967295i32
mov r11d, 0xFFFFFFFF
add r14d, r11d

This is obviously still not anywhere optimal... but it's better!

More importantly this commit unearths most (though probably not quite all) of the implicit dependencies: it will make moving to something more sophisticated much easier. There is still a lot of improvement we could do here. I guess there's at least another 40-50% of performance left through just the register allocator itself. And probably the same sort of amount again is possible by changing other parts of the system that currently require (unnecessary) spills. I've also added several new tests that uncovered existing bugs, or bugs that were there but hidden by the spill allocator.

That's the good stuff. Now the bad. This commit is huge, but it's almost certainly not complete: I'm sure this commit will introduce new bugs. It's also messy, but if I keep polishing it until it's perfect, we'll be waiting for weeks. For example, there are actually two copy-and-pasted register allocators (one for general purpose x64 registers, one for floating point registers), because I realised a bit too late that I might be able to unify them. The good news is that if/when this goes in-tree, nudging it in a better direction bit-by-bit shouldn't be too difficult. We still use the WR0 hack in a few places, because removing some of them is quite a bit of work. The new allocator explicitly avoids using R12 because of this! It'll be good to remove that hack one day. Finally you'll notice that there are functions with the name hack in: you can probably guess why.

Let's start with the basics. This commit (at least temporarily) gets rid of the idea that we know how to do a platform independent register allocator: it's x64 only for the time being. [I suspect we'll be able to generalise this in the near future, but one thing at a time.] The "main" interface from the code generator to the register allocator is get_gp_regs and RegConstraint. For example:

match binop {
  BinOp::Add => {
    let size = lhs.byte_size(self.m);
    let [lhs_reg, rhs_reg] = self.ra.get_gp_regs(
      &mut self.asm,
      iidx,
      [RegConstraint::InputOutput(lhs), RegConstraint::Input(rhs)],
    );
    match size {
      1 => dynasm!(self.asm; add Rb(lhs_reg.code()), Rb(rhs_reg.code())),
      ...
    }

This says "return two x64 registers: lhs_reg will take a value as input and later contain an output value (clobbering the input value, which will be spilled if necessary); rhs_reg will take a value as input (and mustn't clobber it)". Those registers can then be used with dynasmrt as one expects.

The allocator keeps track of which registers have which trace instruction's values in and of where it has spilled an instruction's value: it guarantees to spill an instruction to at most one place on the stack. You can clobber values explicitly if needed: there is support for all the x64 horrors that we currently need (e.g. MUL's overwriting of two registers).

Here are some of the things I know need improving:

  1. Deopt can currently only read values from the stack so every live value must be spilled before a guard. Often this happens naturally, but we also have to spill values pointlessly. Teaching deopt about values in registers will avoid this.

  2. The way we handle trace loop JMPs works, but is a hack: we store the state of registers at traceloop_start and restore them at the backedge. A more generalised approach would be good.

  3. We could make some of the register allocator stuff platform independent.

  4. The register allocator is incredibly stupid and misses many opportunities not to spill things. It's really only a simplified linear scan allocator. Probably we should move to a reverse linear scan register allocator in the near future, now that we've uncovered most of the necessary dependencies.

Not to mention all the code messiness etc.

ptersilie commented 2 months ago

Deopt can currently only read values from the stack so every live value must be spilled before a guard.

I don't think there's a way around this. If the register allocator doesn't spill them, then deopt would have to do it manually (e.g. pushing them to the stack just before calling __yk_deopt). I think it's only going to have to spill caller-saved registers (9 in total) anyway, which shouldn't be too bad. When you say "before a guard", I hope you mean you only spill when the guard fails, and not before checking the guard, right?

vext01 commented 2 months ago

You probably know this, but in your binop example you put the operands into registers to perform the computation. Some (but not all) operations can take operands right out of memory, so register loads can be avoided. Similarly, some encodings allow constant operands, so there'd be no need to load them into a register either.

For example, x86_64 add: image

imm operands are "immediates" (constants) and m variants are memory operands.

The problem with covering all bases is that the codegen explodes with dealing with all the combinations of operand kinds. I guess we should see what other code generators to to avoid that.

Of course we can consider this future work, just wanted to highlight it, because it's a problem I don't currently know how to solve.

ltratt commented 2 months ago

@ptersilie Right now we spill before the guard. You're right that we want to spill after a guard: we could to that generically (i.e. spill all registers, with the advantage of less code generated) or specifically (per guard failure block). I don't have a view on whether generic / specific is better.

@vext01 Yes, as I said the example is not optimal for precisely the reason you point out. This PR is already way too big and taking too long and I don't think you'd appreciate it getting bigger or taking longer :p

vext01 commented 2 months ago

Yes, as I said the example is not optimal for precisely the reason you point out. This PR is already way too big and taking too long and I don't think you'd appreciate it getting bigger or taking longer :p

Understood. I wasn't sure if that was what you were referring to.

vext01 commented 2 months ago

OK, I started trying to review this and gave up about a third of the way in. It's just too big to review meaningfully.

From what I've seen, the design is different to how I expected it to be, but I'm happy to roll with this. It's likely to evolve over time anyway.

ltratt commented 2 months ago

Yeah, sorry, these mungo PRs are impossible to review :/ If it's any consolation, I don't know how to make it any smaller: this seems to be pretty much the minimum possible to break all the implicit dependencies. Future PRs should be normal sized!

ltratt commented 2 months ago

@vext01 Please wait for @ptersilie, plus I will also need to squash.

ptersilie commented 2 months ago

Okay, as far as I can tell this looks reasonable, especially given the constraints you mention in the description. We can tweak it slowly once it's merged. I will have some additional requirements for my remove-traceinputs-work, but we can discuss those offline, and they will be easier done in a follow up PR anyway.

ltratt commented 2 months ago

OK to squash then?

ptersilie commented 2 months ago

Please squash.

ltratt commented 2 months ago

Squashed.