xiaoyin0208 / lz4

Automatically exported from code.google.com/p/lz4
0 stars 0 forks source link

Enhancement: Remove some postincrement #50

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
With Visual Studio 2012, changing:
            *op++ = *ref++;
            *op++ = *ref++;
            *op++ = *ref++;
            *op++ = *ref++;
into:
            op[0] = ref[0];
            op[1] = ref[1];
            op[2] = ref[2];
            op[3] = ref[3];
            op += 4, ref += 4;

gives a 1.5% speed boost in my benchmark.

Original issue reported on code.google.com by strig...@gmail.com on 5 Dec 2012 at 7:21

GoogleCodeExporter commented 8 years ago
Thanks, that's great find. I'll look into it.
You seem to find new optimisations with a relative ease...

Original comment by yann.col...@gmail.com on 6 Dec 2012 at 12:56

GoogleCodeExporter commented 8 years ago
Enhancement request.

Original comment by yann.col...@gmail.com on 6 Dec 2012 at 1:34

GoogleCodeExporter commented 8 years ago
mmmh, this one is less successfull.
No gain with VS2008.
No loss either. It's really equivalent.

I will test more configurations to see if it gets better or worse on different 
compilers / OS.

Original comment by yann.col...@gmail.com on 7 Dec 2012 at 9:46

GoogleCodeExporter commented 8 years ago
VS 2010 : gain of about 0.5%.
This is very small, almost within margin error.
Still, no loss either.

Original comment by yann.col...@gmail.com on 7 Dec 2012 at 10:09

GoogleCodeExporter commented 8 years ago
OK, i checked the Assembler output,
and both codes are translated the same on VS2010.
They are in fact identical.

Your version is closer to the assembler result :
  000eb 0f b6 10     movzx   edx, BYTE PTR [eax]
  000ee 88 11        mov     BYTE PTR [ecx], dl

  000f0 0f b6 50 01  movzx   edx, BYTE PTR [eax+1]
  000f4 88 51 01     mov     BYTE PTR [ecx+1], dl

  000f7 0f b6 50 02  movzx   edx, BYTE PTR [eax+2]
  000fb 88 51 02     mov     BYTE PTR [ecx+2], dl

  000fe 0f b6 50 03  movzx   edx, BYTE PTR [eax+3]
  00102 88 51 03     mov     BYTE PTR [ecx+3], dl

This picked my curiosity, and i tried to reduce the number of instructions,
by merging the final addition into the table.

Hence, instead of doing :
            op[3] = ref[3];
            op += 4, ref += 4; 
            ref -= dec[op-ref];

the code would become :
            op[3] = ref[3];
            ref += inc32[op-ref]; 
            op += 4;

I checked the assembler output, and indeed, the second version is much shorter :
Version 1 :
  000fe 0f b6 50 03  movzx   edx, BYTE PTR [eax+3]
  00102 88 51 03     mov     BYTE PTR [ecx+3], dl
  00105 83 c1 04     add     ecx, 4
  00108 83 c0 04     add     eax, 4
  0010b 8b d1        mov     edx, ecx
  0010d 2b d0        sub     edx, eax
  0010f 2b 44 95 dc  sub     eax, DWORD PTR _dec$[ebp+edx*4]
  00113 eb 0a        jmp     SHORT $LN12@LZ4_uncomp@2

Version 2 :
  00101 0f b6 58 03  movzx   ebx, BYTE PTR [eax+3]
  00105 03 44 bd dc  add     eax, DWORD PTR _inc32$[ebp+edi*4]
  00109 88 59 03     mov     BYTE PTR [ecx+3], bl
  0010c 83 c1 04     add     ecx, 4
  0010f eb 0a        jmp     SHORT $LN12@LZ4_uncomp@2

In spite of the much reduced instruction sequence, version 2 is in fact slower 
than v1, by a sizable amount (~3%).

Well, sometimes, this is almost mystery...

Original comment by yann.col...@gmail.com on 8 Dec 2012 at 12:11

GoogleCodeExporter commented 8 years ago
I use VS2012, it is not the same for me. With postincrement:
            *op++ = *ref++;
004021FA  mov         al,byte ptr [edx]  
004021FC  mov         byte ptr [ecx],al  
004021FE  lea         eax,[ecx+1]  
            *op++ = *ref++;
00402201  movzx       ecx,byte ptr [edx+1]  
00402205  mov         byte ptr [eax],cl  
            *op++ = *ref++;
00402207  movzx       ecx,byte ptr [edx+2]  
0040220B  mov         byte ptr [eax+1],cl  
            *op++ = *ref++;
0040220E  movzx       ecx,byte ptr [edx+3]  
00402212  mov         byte ptr [eax+2],cl  
00402215  add         eax,3  
00402218  add         edx,4  

Without postincrement:
            op[0] = ref[0];
004021FA  movzx       eax,byte ptr [edx]  
004021FD  mov         byte ptr [ecx],al  
            op[1] = ref[1];
004021FF  movzx       eax,byte ptr [edx+1]  
00402203  mov         byte ptr [ecx+1],al  
            op[2] = ref[2];
00402206  movzx       eax,byte ptr [edx+2]  
0040220A  mov         byte ptr [ecx+2],al  
            op[3] = ref[3];
0040220D  movzx       eax,byte ptr [edx+3]  
            op[3] = ref[3];
00402211  mov         byte ptr [ecx+3],al  
            op += 4, ref += 4;
00402214  lea         eax,[ecx+4]  
00402217  add         edx,4  

Without postincrement has one less instruction.

Original comment by strig...@gmail.com on 8 Dec 2012 at 12:14

GoogleCodeExporter commented 8 years ago
I agree it's hard to tell a difference in speed.

Original comment by strig...@gmail.com on 8 Dec 2012 at 12:20

GoogleCodeExporter commented 8 years ago
Well, since your version seems at least equivalent, if not sometimes better, 
there's no problem in using it within LZ4.

I'm still puzzled as to why the tested "second version" is actually slower, 
despite of it being shorter.

I also note that the reduced number of instructions partly comes because v1 
needs to recalculate "op-ref" (within eax) :
  0010b 8b d1        mov     edx, ecx
  0010d 2b d0        sub     edx, eax
  0010f 2b 44 95 dc  sub     eax, DWORD PTR _dec$[ebp+edx*4]

While for v2, the result seems "ready to use" (within edi):
  00105 03 44 bd dc  add     eax, DWORD PTR _inc32$[ebp+edi*4]

Original comment by yann.col...@gmail.com on 8 Dec 2012 at 12:28

GoogleCodeExporter commented 8 years ago
Enhancement integrated into r85.

Original comment by yann.col...@gmail.com on 11 Dec 2012 at 1:41