Optimizing the holes compared with Intel SSE in scale module for ARM

xcl010 / libyuv

Automatically exported from code.google.com/p/libyuv

BSD 3-Clause "New" or "Revised" License

0 stars 0 forks source link

Optimizing the holes compared with Intel SSE in scale module for ARM #406

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

In Issue 319: 64 bit ARMv8 support for libyuv, that there are holes compared 
with Intel SSE in scale module for ARM platform. 

The following functions need to be implemented with ARM NEON:
ScaleRowDown2Linear
ScaleAddRows
ScaleFilterCols
ScaleColsUp2
ScaleARGBRowDown2Linear
ScaleARGBCols
ScaleARGBColsUp2
ScaleARGBFilterCols

Original issue reported on code.google.com by yang.zh...@arm.com on 25 Feb 2015 at 9:17

GoogleCodeExporter commented 9 years ago

I have completed three functions for ARM32/64 as follows:
ScaleRowDown2Linear
ScaleAddRows
ScaleARGBRowDown2Linear

But for other 5 functions:
3 functions (ScaleFilterCols, ScaleARGBCols and ScaleARGBFilterCols) are not 
suitable for NEON SIMD.
2 functions (ScaleColsUp2 and ScaleARGBColsUp2) are not caught by test cases.

So that I want to know whether it is necessary to implement these 5 functions 
with ARM NOEN?

Original comment by yang.zh...@arm.com on 25 Feb 2015 at 9:23

GoogleCodeExporter commented 9 years ago

Original comment by yang.zh...@arm.com on 25 Feb 2015 at 9:25

GoogleCodeExporter commented 9 years ago

ScaleFilterCols and ScaleARGBFilterCols give about 3x performance for general 
purpose bilinear filtering on x86.

FYI On Intel there appears to be a rounding issue, that I'll be first 
attempting to repro with a unittest, and then tweaking how the filtering works.

Original comment by fbarch...@google.com on 25 Feb 2015 at 8:38

GoogleCodeExporter commented 9 years ago

After checking SSE version of ScaleFilterCols, it looks like that two sets of 
data are processed in one loop. But for NEON, loop unrolling with two isn't so 
efficient.

I will try loop unrolling with bigger size such as 8 based on different dx 
varible.

Original comment by yang.zh...@arm.com on 26 Feb 2015 at 8:05

GoogleCodeExporter commented 9 years ago

Conceptually the filter columns attempts to be like filter rows, but first 
rearranges the data so adjacent pixels get put into different registers.

Original comment by fbarch...@google.com on 26 Feb 2015 at 6:40

GoogleCodeExporter commented 9 years ago

See bug 407 for a new function: ARGBToRGB565Dither()
Before shifting 8 bit RGB values down to 5 or 6 bits, add values from the 
dither matrix:

// Ordered 4x4 dither for 888 to 565.  Values from 0 to 7.
static const uint8 kDither565_4x4[16] = {
  0, 4, 1, 5,
  6, 2, 7, 3,
  1, 5, 0, 4,
  7, 3, 6, 2,
};

Do you have time to adapt the ARGBToRGB565 to add dither support?

Original comment by fbarch...@google.com on 10 Mar 2015 at 10:58

GoogleCodeExporter commented 9 years ago

Do you mean if I can add NEON support for ARGBToRGB565Dither?

Currently, I'm working on
1. ScaleFilterCols
About 2x improvement. The patch is ready, depending on the review of patch 
ScaleAddRows

2. ScaleARGBCols
About 1.1x improvement. The patch is ready.

3. ScaleARGBFilterCols
In progress.

When I complete these patches. I think I can handle this function.

Original comment by yang.zh...@arm.com on 11 Mar 2015 at 6:11

GoogleCodeExporter commented 9 years ago

I have completed the patch of ARGBToRGB565Dither for ARM32/64 NEON.
Please check:
https://webrtc-codereview.appspot.com/49409004/

Original comment by yang.zh...@arm.com on 16 Mar 2015 at 6:25

GoogleCodeExporter commented 9 years ago

ScaleRowDown2Linear
ScaleAddRows
ScaleFilterCols
ScaleARGBRowDown2Linear
ScaleARGBCols
ScaleARGBFilterCols

These functions are merged into master.

reference:
https://code.google.com/p/libyuv/issues/detail?id=319

Original comment by yang.zh...@arm.com on 9 Apr 2015 at 6:31

Changed state: Fixed