Too many cpu usage for loop restoration filters.

xuguangxin commented 5 years ago

Just finish a profile on rav1e. I use following command: "rav1e 1080p.y4m -o test.ivf -l 10".

Seems cpu usage for deblocking is very high. sgrproj_stripe_filter used 24.92% cpu on my desktop. And the entire function is not SIMD optimized. Consider the current development phase, is it worth using some SIMD to speed up it?

thanks

lu-zero commented 5 years ago

From a cursory look it seems that there are few tweaks that can be done on the inner functions, surely it looks like a target worth of optimization.

ycho commented 5 years ago

FYI, sgrproj_stripe_filter() is one of two LR (Loop Restoration) filters and not from deblocking filter. Another LR filter is a Wiener filter.

tdaede commented 5 years ago

Yeah, it's pretty bad. There are two things we can do to massively improve this: 1) Take SIMD from dav1d. This will improve it by a factor of 4-8x. @xiphmont is already working on this. 2) Parallelize it using the same infrastructure used for tiling. This doesn't reduce CPU usage, but will increase the FPS a lot (especially with tiling) as new frames can't be started until the previous frame has the loop filters applied.

xuguangxin commented 5 years ago

FYI, sgrproj_stripe_filter() is one of LR (Loop Restoration) filter and not from deblocking filter. Another one is Wiener filter.

Thanks for the correction.

From a cursory look it seems that there are few tweaks that can be done on the inner functions, surely it looks like a target worth of optimization.

Yes, this function is the hotest function. https://github.com/xiph/rav1e/blob/e09612002b3247c2c53c81b9aa94b1ee6d61c989/src/lrf.rs#L185-L199 It will do a square sum on 3x3 block on a pixel (x,y). It's totally 9 multiples. If we can save the last 2 rows at somewhere. It will save 6 multiple for next pixel (x, y+1). I did not check details, but I think the same trick can use in the column too.

Yeah, it's pretty bad. There are two things we can do to massively improve this:

Take SIMD from dav1d. This will improve it by a factor of 4-8x. @xiphmont is already working on this.

Great! Waiting for the good news.

Parallelize it using the same infrastructure used for tiling. This doesn't reduce CPU usage, but will increase the FPS a lot (especially with tiling) as new frames can't be started until the previous frame has the loop filters applied.

Parallelize is a good area to improve. I did some comparison on Xeon E5 before. Seems rav1e's cpu usage is very low. This mean, we did not use too much thread in default mode.

codec	fps	cores (cpu usage/100)	fps / cores
rav1e	0.216	1.3	0.1661538462
x265-veryslow	2.4	8.02	0.2992518703
x265-medium	17.74	9.37	1.893276414
x265-ultrafast	110	24.69	4.455245038
x264-veryslow	23	18	1.277777778
x264-medium	76	12	6.333333333
x264-ultrafast	486	10	48.6

thanks

xiphmont commented 5 years ago

The loop restoration is, indeed, dog slow. The solving algorithm is simplistic, brute-force, and probably needlessly wasteful.
I don't want to spend much time optimizing the implementation when the implementation is likely to be replaced (that is, the solution I want is not just to do the math faster, but to do far less math to begin with).

ycho commented 5 years ago

The loop restoration is, indeed, dog slow. The solving algorithm is simplistic, brute-force, and probably needlessly wasteful. I don't want to spend much time optimizing the implementation when the implementation is likely to be replaced (that is, the solution I want is not just to do the math faster, but to do far less math to begin with).

Yes, I agree on @xiphmont 's view. Given research level code, we'd want to save efforts on writing speed optimizing code but rather invest that time on better algorithms for longer term uses.

ycho commented 5 years ago

It will do a square sum on 3x3 block on a pixel (x,y). It's totally 9 multiples. If we can save the last 2 rows at somewhere. It will save 6 multiple for next pixel (x, y+1). I did not check details, but I think the same trick can use in the column too.

In fact, box filters (i.e. 3x3 or 5x5 sum and square sum) are obtained by using the technique called 'Integral Image'. That enables not only reusing of intermediate row sum and square sums but also column wise as well.

xuguangxin commented 5 years ago

@xiphmont, @ycho . thanks for explaining. I want to spend some time to do local optimization. Except for the loop restoration, any area worth to look at?

thanks

ycho commented 5 years ago

Integral imaging: https://en.wikipedia.org/wiki/Summed-area_table#/media/File:Summed_area_table.png

Where to use integral imaging for box filtering in AV1 Loop Restoration filter : Sec 2.2, around eq (2), https://people.xiph.org/~yushin/tmp__/A%20switchable%20loop-restoration%20with%20side-information%20framework%20for%20the%20emerging%20AV1%20video%20codec%202017%20ICIP.pdf

xuguangxin commented 5 years ago

Hi @ycho , Thanks for the information. Since @xiphmont already working on loop restoration optimization. Any other areas I can help with? I am still ramp up the AV1, so it's better if the optimization can localize in one or two files or functions.

thanks

xiph / rav1e

Too many cpu usage for loop restoration filters. #1287