[Sprint 3] Loop2Sum Pass - Connect Blocks and Adjust Conditions (Part 2)

Overview (For remind, identical to the last one)

Implemented Loop2Sum, where sum of loop operations are optimized to chunks of 7. Since the sum operation gets 8 summands as its operands, one of the operand must be used to add the cumulative sum. Hence, the loops can be truncated into units of 7. The sum operation optimizes 7 additions with the cost of two additions, hence is an optimizable aspect.

Simple example:

for(int i=0; i<n; ++i) {
  result += A[i] * B[n-i];
}

can be transformed into

for(int i=0; i<n/7; ++i) { //Vectorized Part
  c[0] = A[i * 7 + 0] * B[n - (i * 7 + 0)]
  c[1] = A[i * 7 + 1] * B[n - (i * 7 + 1)]
  ...
  c[6] = A[i * 7 + 6] * B[n - (i * 7 + 6)]
  result = sum (result, c[0], c[1], ... , c[6])
}
for(int i=7*(n/7); i<n; ++i) { //Scalar Part
  result += A[i] * B[n-i];
}

Side Note: It works similar to loop unrolling, hence it also reduces 6 branch costs + loop condition comparison costs. It is comparably minor, though.

Logic

Continued from Part 1 From Part 1, I obtained a vectorized block that has 7 consecutive copies of the loop block. Then, I connected the basic blocks in the following form. More blocks and its terminology should be defined.

       Header
       ()   \
Vectorized  SubHeader
              ()   \
       Scalarized  EndBlock

The loop is breaked down into two consecutive loops: the vectorized loop and the leftover parts(we name it, scalarized part). Since the scalarized loop is also a loop structure, we consider it as a new loop, which as a Header/Latch structure. The corresponding Header block is named: SubHeader which acts as an exit block to the original Header. The original exit block is connected to the SubHeader false branch, and named EndBlock.

Generate each blocks, connect them, control PHI nodes, and adjust data dependencies to satisfy SSA structure.

Then I changed the loop condition: the vectorized loop runs for for(i=a; i<b-7;) where i is incremented 7 times in the vectorized loop. The scalarized loop runs for for(; i<b; ++i), where the i value is maintained from the vectorized loop. The scalarized loop runs at most 6 times.

Implementation

In loop2sum.cpp: (continued from Part 1)

Get the EndBlock, and generate SubHeader by cloning the instructions of Header.
Adjust PHINodes in SubHeader, to get from Header and Scalarized.
Adjust PHINode in EndBlock, which is guaranteed to income in the form of PHI because of LCSSA structure generated by SimplifyCFG.
Adjust PHINodes in Header, to get from PreHeader and Vectorized.
Connect each blocks by branches.
Modify all variables in Scalarized to satisfy SSA.
Change branch condition in Header to i<b-7.

Unit tests (Will be added as checkfile/loop2sum/test#.ll)

Since the PassInfoMixIn LoopPassManager got stuck after modifying the loop, I could not further move on. Instead, just as Part 1, I added the outputs of loop2sum passing sum?CFG.ll as sum?CFGoutput2.txt where ?=1,4,5.

?=5 cannot be optimized because of harmful instructions. The other cases showed a nice result of vectorizing and connecting blocks.

willi19 / swpp202301-compiler-team6