Open xuguangxin opened 5 years ago
From a cursory look it seems that there are few tweaks that can be done on the inner functions, surely it looks like a target worth of optimization.
FYI, sgrproj_stripe_filter() is one of two LR (Loop Restoration) filters and not from deblocking filter. Another LR filter is a Wiener filter.
Yeah, it's pretty bad. There are two things we can do to massively improve this: 1) Take SIMD from dav1d. This will improve it by a factor of 4-8x. @xiphmont is already working on this. 2) Parallelize it using the same infrastructure used for tiling. This doesn't reduce CPU usage, but will increase the FPS a lot (especially with tiling) as new frames can't be started until the previous frame has the loop filters applied.
FYI, sgrproj_stripe_filter() is one of LR (Loop Restoration) filter and not from deblocking filter. Another one is Wiener filter.
Thanks for the correction.
From a cursory look it seems that there are few tweaks that can be done on the inner functions, surely it looks like a target worth of optimization.
Yes, this function is the hotest function. https://github.com/xiph/rav1e/blob/e09612002b3247c2c53c81b9aa94b1ee6d61c989/src/lrf.rs#L185-L199 It will do a square sum on 3x3 block on a pixel (x,y). It's totally 9 multiples. If we can save the last 2 rows at somewhere. It will save 6 multiple for next pixel (x, y+1). I did not check details, but I think the same trick can use in the column too.
Yeah, it's pretty bad. There are two things we can do to massively improve this:
- Take SIMD from dav1d. This will improve it by a factor of 4-8x. @xiphmont is already working on this.
Great! Waiting for the good news.
- Parallelize it using the same infrastructure used for tiling. This doesn't reduce CPU usage, but will increase the FPS a lot (especially with tiling) as new frames can't be started until the previous frame has the loop filters applied.
Parallelize is a good area to improve. I did some comparison on Xeon E5 before. Seems rav1e's cpu usage is very low. This mean, we did not use too much thread in default mode.
codec | fps | cores (cpu usage/100) | fps / cores |
---|---|---|---|
rav1e | 0.216 | 1.3 | 0.1661538462 |
x265-veryslow | 2.4 | 8.02 | 0.2992518703 |
x265-medium | 17.74 | 9.37 | 1.893276414 |
x265-ultrafast | 110 | 24.69 | 4.455245038 |
x264-veryslow | 23 | 18 | 1.277777778 |
x264-medium | 76 | 12 | 6.333333333 |
x264-ultrafast | 486 | 10 | 48.6 |
thanks
The loop restoration is, indeed, dog slow. The solving algorithm is simplistic, brute-force, and probably needlessly wasteful.
I don't want to spend much time optimizing the implementation when the implementation is likely to be replaced (that is, the solution I want is not just to do the math faster, but to do far less math to begin with).
The loop restoration is, indeed, dog slow. The solving algorithm is simplistic, brute-force, and probably needlessly wasteful. I don't want to spend much time optimizing the implementation when the implementation is likely to be replaced (that is, the solution I want is not just to do the math faster, but to do far less math to begin with).
Yes, I agree on @xiphmont 's view. Given research level code, we'd want to save efforts on writing speed optimizing code but rather invest that time on better algorithms for longer term uses.
It will do a square sum on 3x3 block on a pixel (x,y). It's totally 9 multiples. If we can save the last 2 rows at somewhere. It will save 6 multiple for next pixel (x, y+1). I did not check details, but I think the same trick can use in the column too.
In fact, box filters (i.e. 3x3 or 5x5 sum and square sum) are obtained by using the technique called 'Integral Image'. That enables not only reusing of intermediate row sum and square sums but also column wise as well.
@xiphmont, @ycho . thanks for explaining. I want to spend some time to do local optimization. Except for the loop restoration, any area worth to look at?
thanks
Integral imaging: https://en.wikipedia.org/wiki/Summed-area_table#/media/File:Summed_area_table.png
Where to use integral imaging for box filtering in AV1 Loop Restoration filter : Sec 2.2, around eq (2), https://people.xiph.org/~yushin/tmp__/A%20switchable%20loop-restoration%20with%20side-information%20framework%20for%20the%20emerging%20AV1%20video%20codec%202017%20ICIP.pdf
Hi @ycho , Thanks for the information. Since @xiphmont already working on loop restoration optimization. Any other areas I can help with? I am still ramp up the AV1, so it's better if the optimization can localize in one or two files or functions.
thanks
Just finish a profile on rav1e. I use following command: "rav1e 1080p.y4m -o test.ivf -l 10".
Seems cpu usage for deblocking is very high. sgrproj_stripe_filter used 24.92% cpu on my desktop. And the entire function is not SIMD optimized. Consider the current development phase, is it worth using some SIMD to speed up it?
thanks