wsmoses / Tapir-LLVM

Tapir extension to LLVM for optimizing Parallel Programs
Other
133 stars 24 forks source link

[InstCombine] May make task closures large by sinking instructions from parent to child tasks #80

Open VictorYing opened 6 years ago

VictorYing commented 6 years ago

InstructionCombining sinks any instruction it can if that instruction is only used in a single successor of it's current basic block: https://github.com/wsmoses/Tapir-LLVM/blob/2031611b65d67c1986af06c5f98b3c1d5e140d51/lib/Transforms/InstCombine/InstructionCombining.cpp#L2946-L2955

This presents significant problems for me, when compiling Swarm tasks. Consider a task that starts out like

foo = a + b * c + d * e + f * g + ....;
spawn {
  some_var = foo;
}

and assume all those variables a, b, c, d, etc. are local values held in registers, so in the Tapir SSA form, there are no memory operations here other than the store inside the task to the variable some_var, which is the only variable here that is not promoted to registers. This task will have a very tiny closure, containing the value tmp computed in the parent. This is very good for targets such as Swarm that efficiently support tasks with very small numbers of arguments.

But InstCombine sees all the arithmetic instructions are used only inside the detached block and sinks the computation there, producing a task like

spawn {
  some_var = a + b * c + d * e + f * g + ....;
}

Now this is a task with many inputs that need to be captured from its parent task and passed to the child task, in order for the child task to perform the arithmetic. This significantly increases the task overhead.

On the other hand, there are certainly cases where sinking instructions from a parent task to a child task is a good idea that shortens the critical path, so this seems like a problem where you'd want to come up with better heuristics to determine what instructions to sink (which, on the Swarm side of things, we've started to think about: https://github.mit.edu/swarm/Swarm-IR/blob/77c721b9f4bb279331daa4fd9ba12868fec75af3/lib/Transforms/Utils/SwarmUtils.cpp#L565-L586)

This is probably a much less important optimization issue for Tapir's compilation of Cilk tasks, which are at least usually a little coarser grained, compared to my compilation of Swarm tasks, which tend to be very fine grained and I legitimately frequently have tasks that store to a single location, so moving extra instructions around really affects task overheads which in turn can easily cause large changes in work efficiency. Nonetheless, I think this problem isn't fundamentally Swarm-specific. You can see my patch where I simply killed off instruction sinking through detaches here: https://github.mit.edu/swarm/Swarm-IR/commit/809027c0477795f1fe26f0b159426a8648321151

neboat commented 6 years ago

Thanks for pointing this out.

I don't think Tapir should disable instruction sinking across detaches in general. Without taking task-invocation overhead into account, this form of instruction sinking on its own is a win in terms of parallelism: it does not increase the asymptotic work, and it can only reduce the critical-path length (span). In addition, instruction sinking often facilitates other optimizations, such as other instruction combining or simplification, which can reduce the work of the computation.

Task-invocation overhead does mess with this analysis. Although instruction sinking alone shouldn't affect asymptotic work (and can only reduce the span), more instructions might be needed in practice to pass the function arguments. Hence some form of improved cost analysis would be desirable, as you mentioned. The ability for instruction sinking to enable other compiler optimizations seems to complicate the cost analysis.

It's probably worth considering adding cost analysis based on the target parallel-runtime library, which should probably be encoded in TargetLibraryInfo. For runtimes such as Swarm, instruction sinking might be harder to justify in terms of cost. For other runtimes it might not matter as much. In fact, for some runtimes, it may be an optimization to serialize tasks as small as those in your example, due to the task-invocation overhead.

VictorYing commented 6 years ago

I agree it's an interesting trade-off. One thing to consider is that we're not talking about sinking memory operations or any operation with side effects: TryToSinkInstruction already refuses to move those across task boundaries. We're talking therefore mainly about sinking fairly cheap arithmetic instructions or things like address computations. At least in Swarm, and I imagine in other parallel systems as well, computation is cheap, memory accesses and communication/synchronization are expensive, and saving on task communication bandwidth tends to be a win.