[Sprint 2] Loop2Sum Pass - Vectorize Loop Block (Incomplete, Part1)

germanium32 commented 1 year ago

Overview

Implemented Loop2Sum, where sum of loop operations are optimized to chunks of 7. Since the sum operation gets 8 summands as its operands, one of the operand must be used to add the cumulative sum. Hence, the loops can be truncated into units of 7. The sum operation optimizes 7 additions with the cost of two additions, hence is an optimizable aspect.

Simple example:

for(int i=0; i<n; ++i) {
  result += A[i] * B[n-i];
}

can be transformed into

for(int i=0; i<n/7; ++i) { //Vectorized Part
  c[0] = A[i * 7 + 0] * B[n - (i * 7 + 0)]
  c[1] = A[i * 7 + 1] * B[n - (i * 7 + 1)]
  ...
  c[6] = A[i * 7 + 6] * B[n - (i * 7 + 6)]
  result = sum (result, c[0], c[1], ... , c[6])
}
for(int i=7*(n/7); i<n; ++i) { //Scalar Part
  result += A[i] * B[n-i];
}

Side Note: It works similar to loop unrolling, hence it also reduces 6 branch costs + loop condition comparison costs. It is comparably minor, though.

Logic

The pass is very similar to automated vectorization, where loops can be vectorized, or splitted into chunks, which is used for parallelization. We basically do the same thing, a slight difference on controlling the Induction Variable and the Accumulator Variable. The Induction Variable is the variable that controls the loop. A common example is the i variable in for(int i=0; i<n; ++i). The Accumulator Variable is the variable that stores the cumulative sum: the sum variable in for(~) sum+=a[i].

Since LLVM IR must satisfy SSA form, simply repeating all instructions 7 times doesn't give a valid result. Hence, for each step, we must link the older instruction and newer instruction where each variable uses each other. Also, in most cases, each instructions are dependent on the Induction Variable, so the Induction Variable must be handled meticulously.

After I converted several C files into LL files, I have found a common compilation style. First, a loop is constructed as the cond part, body part, inc part, and the exit part. (Each part can be consisted of a single block, or multiple blocks) Also, most of the inc part is merged into the body part if we apply SimplifyCFG.

Hence, I assumed a simple form of loop structure, the Two-Block Loop. The Two-Block Loop consists of cond part and body part, and each part consists of a single block. Therefore, the loop is a 2-block structure. Such structure as no other branches(if/else statements) inside each block, so we can traverse over only two blocks, which extremely simplifies the implementation. In fact, I denoted the cond block, the Header, and the body block, Latch, which is conventional in loop analysis.

Also, assume that the sum operation is vectorizable. That is, 7 instructions can be compressed in a single block, without any dependencies.

From Here, Further Implementation Needed After vectorizing the Latch, we must connect the basic blocks depending on the trip count. If 7 trips are impossible, the LLVM IR must fall into a Scalar part, which is the original loop Latch. In order to simplify counting the trip count, we assume that the Induction Variable increments by 1 each step, and the end condition is of the form i<n.

However, completing the implementation was merely possible, since the implementation up to this point already was complex; exceeding the line diff count bounds. (+370 or so)

Implementation

In opt.cpp: add FPM.addPass(createFunctionToLoopPassAdaptor(loop2sum::Loop2SumPass()) under the SimplifyCFG pass, which guarantees the Two-Block structure. createFunctionToLoopPassAdaptor was used to apply loop analysis over FPM. I think there are updates in the main repo, which adds loop passes.

In loop2sum.cpp:

Get the Header and Latch block, and check whether the loop is in Two-Block loop structure.
Find the Induction/Accumulator Variable searching over phi nodes, checking variable dependencies, and add operations.
Check whether the loop has simple increments, that is, in the form of for(int i=a; i<b; ++i)
Check if the Latch block contains harmful instructions that may strangle data dependencies.
Make a Vectorized Block that connects 7 consecutive loops in one block.
Adjust variables that are dependent to each other, especially the Induction Variable. Variables in each loop step has a suffix, .vec(i) where i is the vectorization index.
Remove all add operations that uses the Accumulator Variable, and insert a sum operation at the end.
Must be implemented further, e.g. on Sprint 3 or wrap up

Unit tests (Will be added as checkfile/loop2sum/test#.ll)

There are no unit tests currently, but I have checked that the loops were transformed into vectorized loops well, having no contradiction in dependencies or SSA.

germanium32 commented 1 year ago

Additional implementation points:

I think I can add NEQ to the comparison condition, next to SLT, ULT

Need extra block between Header and Scalar loop block. In total,

       Header
       ()   \
VectorizedLoop  Header2
              ()   \
       ScalarLoop  Exit

more..

germanium32 commented 1 year ago

Found some problems, the program removes all instructions that use the Accumulator Variable, which can be unwanted, as in such case:

for(int i=0; i<n; ++i) {
  sum += i * i;
  sum2 += sum;
}

I will alter the code to detect loops that only use the Accumulator Variable once.

germanium32 commented 1 year ago

Multiple uses of U variable found: LPMUpdate &U and Uses &U. Change the latter variable names.

willi19 commented 1 year ago

What if n%7 !=0? I can't find any logic.

germanium32 commented 1 year ago

What if n%7 !=0? I can't find any logic.

Connecting blocks and controlling branches will be implemented further on. Up to this sprint, I only implemented the vectorization block part.

willi19 commented 1 year ago

https://llvm.org/doxygen/classllvm_1_1Loop.html

isCanonical function can check the loop is form of ++i. How about using this?

willi19 / swpp202301-compiler-team6