Closed germanium32 closed 1 year ago
Additional implementation points:
Header
() \
VectorizedLoop Header2
() \
ScalarLoop Exit
Found some problems, the program removes all instructions that use the Accumulator Variable, which can be unwanted, as in such case:
for(int i=0; i<n; ++i) {
sum += i * i;
sum2 += sum;
}
I will alter the code to detect loops that only use the Accumulator Variable once.
Multiple uses of U
variable found: LPMUpdate &U
and Uses &U
. Change the latter variable names.
What if n%7 !=0? I can't find any logic.
What if n%7 !=0? I can't find any logic.
Connecting blocks and controlling branches will be implemented further on. Up to this sprint, I only implemented the vectorization block part.
https://llvm.org/doxygen/classllvm_1_1Loop.html
isCanonical function can check the loop is form of ++i. How about using this?
Overview
Implemented Loop2Sum, where sum of loop operations are optimized to chunks of 7. Since the sum operation gets 8 summands as its operands, one of the operand must be used to add the cumulative sum. Hence, the loops can be truncated into units of 7. The sum operation optimizes 7 additions with the cost of two additions, hence is an optimizable aspect.
Simple example:
can be transformed into
Side Note: It works similar to loop unrolling, hence it also reduces 6 branch costs + loop condition comparison costs. It is comparably minor, though.
Logic
The pass is very similar to automated vectorization, where loops can be vectorized, or splitted into chunks, which is used for parallelization. We basically do the same thing, a slight difference on controlling the Induction Variable and the Accumulator Variable. The Induction Variable is the variable that controls the loop. A common example is the i variable in
for(int i=0; i<n; ++i)
. The Accumulator Variable is the variable that stores the cumulative sum: the sum variable infor(~) sum+=a[i]
.Since LLVM IR must satisfy SSA form, simply repeating all instructions 7 times doesn't give a valid result. Hence, for each step, we must link the older instruction and newer instruction where each variable uses each other. Also, in most cases, each instructions are dependent on the Induction Variable, so the Induction Variable must be handled meticulously.
After I converted several C files into LL files, I have found a common compilation style. First, a loop is constructed as the cond part, body part, inc part, and the exit part. (Each part can be consisted of a single block, or multiple blocks) Also, most of the inc part is merged into the body part if we apply SimplifyCFG.
Hence, I assumed a simple form of loop structure, the Two-Block Loop. The Two-Block Loop consists of cond part and body part, and each part consists of a single block. Therefore, the loop is a 2-block structure. Such structure as no other branches(if/else statements) inside each block, so we can traverse over only two blocks, which extremely simplifies the implementation. In fact, I denoted the cond block, the Header, and the body block, Latch, which is conventional in loop analysis.
Also, assume that the sum operation is vectorizable. That is, 7 instructions can be compressed in a single block, without any dependencies.
From Here, Further Implementation Needed After vectorizing the Latch, we must connect the basic blocks depending on the trip count. If 7 trips are impossible, the LLVM IR must fall into a Scalar part, which is the original loop Latch. In order to simplify counting the trip count, we assume that the Induction Variable increments by 1 each step, and the end condition is of the form
i<n
.However, completing the implementation was merely possible, since the implementation up to this point already was complex; exceeding the line diff count bounds. (+370 or so)
Implementation
In
opt.cpp
: addFPM.addPass(createFunctionToLoopPassAdaptor(loop2sum::Loop2SumPass())
under the SimplifyCFG pass, which guarantees the Two-Block structure.createFunctionToLoopPassAdaptor
was used to apply loop analysis over FPM. I think there are updates in the main repo, which adds loop passes.In
loop2sum.cpp
:for(int i=a; i<b; ++i)
.vec(i)
wherei
is the vectorization index.sum
operation at the end.Unit tests (Will be added as checkfile/loop2sum/test#.ll)
There are no unit tests currently, but I have checked that the loops were transformed into vectorized loops well, having no contradiction in dependencies or SSA.