Open sharaelong opened 1 year ago
1 or 3 seems valid.
3. I believe 204800 is the start of the heap.
More specifically, my question was multiple calling of load 8 204800
is redundant, since graph
is read-only after read from input. So I believe function inline will optimize it by other passes' help.
Well it seems advance is getting pointer as argument so i think using the same pointer to different functions is somewhat dangerous
Another analysis: I analyzed about array access pattern in bitcount3 case.
r1 = mul r3 4294967296 64
r1 = sdiv r1 4294967296 64
r1 = mul r1 4 64
r1 = add arg2 r1 64
r4 = load 4 r1
This kind of IR pattern is seemed to be compilation of below:
%idxprom = sext i32 %i.0 to i64
%arrayidx = getelementptr inbounds i32, i32* %confidence, i64 %idxprom
%1 = load i32, i32* %arrayidx, align 4
Then I expect that we can actually construct new IR sequences without using getelementptr
, maybe? Also sext
is really needed? Is there any constraints which I don't know? Specifically, I mean:
r1 = mul r3 4 64
r1 = add arg2 r1 64
r4 = load 4 r1
is invalid?
I think we need to analyze assembly output from current state of our compiler, to make sure what is performance bottleneck and is there any improvements. (we are in lack of 'brilliant' ideas now...) I applied passes in order like this for doing this analysis:
gcd This function shows negative performance when applying sprint2-finished compiler. We knew that
simplifycfg
contribute negatively about 6%. However due to simple function structure ofgcd(x, y)
, tailcallelim pass compensate it about 5%. But I think there is a lot more chance to optimize it. gcd is mathematically symmetric function, but given code checks every cases such asx < y
andx >= y
always. AFAIK, ideal implementation of gcd is looks like this:First I tried to find existing llvm pass which analyze this kind of job... but I failed it. @goranmoomin maybe know something... So what can we do for this?
floyd I'm going to make 2 comments here:
advance()
function do something very simple things. It is 'pure' when input is given. However it seemed to be even GVN pass failed to take advantage of it. L22 and L23 compiled like this:Also I don't have no idea for why
load 8 204800
called so many times. I expect that it will removed after function inlining, which allows llvm's existing pass do more well-performed analysis.malloc_upto_8()
is one of the bottleneck too!It is correspond to L17, L19 of c source code:
nibble = num & 0xf
andarr[nibble]
access.Also I have thought about possibilities of using constant array of length 16 here.
nibble
is constrained to have value [0, 16), so we might replace array access instead of 16-branching switch statement instead ofarr[nibble]
? Because we know every value on compile time.Maybe there are no copy instructons, so I suspect this is some kind of tricks. However following instruction
r1 = mul r1 8 64
allows us removing mul & sdiv operations!Second, I want to ask that it is intended operations in load2aload pass? I found assembly output for calculating
max()
:aload
instruction is too close to next usage of%r6
register. Further this kind of situation frequently happened.Third, I think GVN pass operates well on this case (since it used logic similar as segment tree, which shows impressive performance gain in 'wall' test cases), so I doubt why this test case shows only 1.2% performance improvement.