[Sprint 3, Wrap up] Analysis of assembly file for low-optimized test cases

sharaelong commented 1 year ago

I think we need to analyze assembly output from current state of our compiler, to make sure what is performance bottleneck and is there any improvements. (we are in lack of 'brilliant' ideas now...) I applied passes in order like this for doing this analysis:

// Add loop-level opt passes below
    FPM.addPass(llvm::SimplifyCFGPass());
    FPM.addPass(createFunctionToLoopPassAdaptor(loop2sum::Loop2SumPass()));
    FPM.addPass(llvm::SimplifyCFGPass());
    FPM.addPass(llvm::createFunctionToLoopPassAdaptor(std::move(LPM)));
    // Add function-level opt passes below
    FPM.addPass(llvm::PromotePass());
    FPM.addPass(llvm::TailCallElimPass());
    FPM.addPass(arithmeticpass::ArithmeticPass());
    FPM.addPass(branchpredict::BranchPredictPass());
    FPM.addPass(load2aload::Load2AloadPass());
    FPM.addPass(llvm::GVNPass());

    CGPM.addPass(llvm::createCGSCCToFunctionPassAdaptor(std::move(FPM)));
    // Add CGSCC-level opt passes below

    MPM.addPass(llvm::createModuleToPostOrderCGSCCPassAdaptor(std::move(CGPM)));
    // Add module-level opt passes below
    MPM.addPass(heap2stack::Heap2StackPass());
    MPM.addPass(llvm::VerifierPass());

gcd This function shows negative performance when applying sprint2-finished compiler. We knew that simplifycfg contribute negatively about 6%. However due to simple function structure of gcd(x, y), tailcallelim pass compensate it about 5%. But I think there is a lot more chance to optimize it. gcd is mathematically symmetric function, but given code checks every cases such as x < y and x >= y always. AFAIK, ideal implementation of gcd is looks like this:
```
int gcd(int x, int y) {
if (x == 0) return y;
return gcd(y % x, x);
}
```
First I tried to find existing llvm pass which analyze this kind of job... but I failed it. @goranmoomin maybe know something... So what can we do for this?
floyd I'm going to make 2 comments here:
- function inline pass will do something more powerful thing than we expect. First we can see advance() function do something very simple things. It is 'pure' when input is given. However it seemed to be even GVN pass failed to take advantage of it. L22 and L23 compiled like this:
```
start floyd 0:
.entry:
r1 = load 8 204800
r2 = call advance r1
r1 = load 8 204800
r1 = call advance r1
r1 = call advance r1
br .while.cond
```
  Also I don't have no idea for why load 8 204800 called so many times. I expect that it will removed after function inlining, which allows llvm's existing pass do more well-performed analysis. malloc_upto_8() is one of the bottleneck too!

Advanced arithmetic pass responsible for using inc, dec instructions will improve performance here!

bitcount3 First I don't know why these two consecutive mul & sdiv exists for:
```
.if.end:
r1 = urem r3 16 32
r1 = mul r1 4294967296 64
r1 = sdiv r1 4294967296 64
r1 = mul r1 4 64
r1 = add r1 204800 64
```
It is correspond to L17, L19 of c source code: nibble = num & 0xf and arr[nibble] access.

Also I have thought about possibilities of using constant array of length 16 here. nibble is constrained to have value [0, 16), so we might replace array access instead of 16-branching switch statement instead of arr[nibble]? Because we know every value on compile time.

friend First, I also saw same things as above:
```
r1 = mul arg1 4294967296 64
r1 = sdiv r1 4294967296 64
r1 = mul r1 8 64
```
Maybe there are no copy instructons, so I suspect this is some kind of tricks. However following instruction r1 = mul r1 8 64 allows us removing mul & sdiv operations!

Second, I want to ask that it is intended operations in load2aload pass? I found assembly output for calculating max():

r6 = mul r6 4 64
r8 = add r2 r6 64
r6 = aload 4 r8
r6 = icmp sgt r7 r6 32
br r6 .cond.true58 .cond.false62

aload instruction is too close to next usage of %r6 register. Further this kind of situation frequently happened.

r4 = mul r5 4294967296 64
r4 = sdiv r4 4294967296 64
r4 = mul r4 4 64
r6 = add r2 r4 64
r4 = load 4 r6
r5 = aload 4 r1
r4 = icmp sgt r4 r5 32

Third, I think GVN pass operates well on this case (since it used logic similar as segment tree, which shows impressive performance gain in 'wall' test cases), so I doubt why this test case shows only 1.2% performance improvement.

binary_tree ...what?
```
r1 = mul r4 1 64
r2 = mul r2 1 64
```

willi19 commented 1 year ago

I believe 204800 is the start of the heap.
It's weird, I thought gvn can take care of those cases.

1 or 3 seems valid.

sharaelong commented 1 year ago

3. I believe 204800 is the start of the heap.

More specifically, my question was multiple calling of load 8 204800 is redundant, since graph is read-only after read from input. So I believe function inline will optimize it by other passes' help.

willi19 commented 1 year ago

Well it seems advance is getting pointer as argument so i think using the same pointer to different functions is somewhat dangerous

sharaelong commented 1 year ago

Another analysis: I analyzed about array access pattern in bitcount3 case.

r1 = mul r3 4294967296 64
r1 = sdiv r1 4294967296 64
r1 = mul r1 4 64
r1 = add arg2 r1 64
r4 = load 4 r1

This kind of IR pattern is seemed to be compilation of below:

%idxprom = sext i32 %i.0 to i64
%arrayidx = getelementptr inbounds i32, i32* %confidence, i64 %idxprom
%1 = load i32, i32* %arrayidx, align 4

Then I expect that we can actually construct new IR sequences without using getelementptr, maybe? Also sext is really needed? Is there any constraints which I don't know? Specifically, I mean:

r1 = mul r3 4 64
r1 = add arg2 r1 64
r4 = load 4 r1

is invalid?

willi19 / swpp202301-compiler-team6

[Sprint 3, Wrap up] Analysis of assembly file for low-optimized test cases #33