Enable better vectorization for generic convolution

heshpdx commented 5 months ago

flac is being considered for the next version of the SPEC CPU benchmark suite, currently dubbed CPUv8. As such, we are interested in the generic (non-intrinsic) code paths that can run on all current and future architectures. And as part of the intense scrutiny that benchmarks undergo, we saw a performance improvement opportunity in the generic code path in libFLAC/lpc.c and would like to share it with the community.

In the loop for computing residual from qlp coefficients, if we break the single dependence chain into two parallel sub-chains, we allow vectorization instructions to be interleaved during program execution. This provides 2-4% performance uplift as measured on modern ARM systems due to increased instruction level parallelism. Our two benchmark workloads are encoding podcasts using -7ep --replay-gain and -8p. We did try breaking the chain down further into four sub-chains, but didn't see any additional gains.

ktmf01 commented 5 months ago

Thanks for providing this information and sharing this optimization. I do have some questions though.

You mention this change is an improvement for "modern ARM systems". Did you do testing on other systems and architectures too? Can you provide some numbers? I guess the intrinsics parts are stripped for the benchmark, so performance of this bit of code is important on x86-64 too?

Also, can you explain how you tested this? By default FLAC is configured to compile with the GCC options -fassociative-math -fno-signed-zeros -fno-trapping-math -freciprocal-math which should already make the compiler break the dependency chains you mention. Did you test compiles with these options on or off?

heshpdx commented 5 months ago

Sure thing. Since we are running within the SPEC CPU harness, all files are built with the same compiler switches (that is one of the requirements for the benchmark). We used gcc-14.0.1-9c7cf5d and our baseline switches are -g -O3 -flto=32 -funroll-loops -ffast-math --param max-completely-peeled-insns=600.

Here are the measured runtimes in seconds. one-sum is the original baseline, two-sum is with the new code. You are right, intrinsics are stripped so we are comparing apples to apples. It does help x86 as well.

machine / vectors	one-sum	two-sum	delta
AMD Genoa / AVX-128	97.8	96.0	1.8%
AmpereOne / NEON-128	141	134	5.2%
Ampere Altra / NEON-128	252	247	2.0%

After seeing your comment above, we rebuilt with the FLAC default compiler options added, and the results were the same. Those switches don't benefit us here: -fassociative-math could help but it only applies to floating-point operations and this function uses integer type. We believe that operation reassociation is on by default for integer types, at least in gcc.

ktmf01 commented 5 months ago

After seeing your comment above, we rebuilt with the FLAC default compiler options added, and the results were the same.

Just to be clear, you added those options to the options you mentioned above?

The thing is, I've tried GCC's LTO in the past, and without exception it produced slower builds. So perhaps this optimization only works with LTO, because I am seeing no gains at all. My best guess is that without LTO, these functions are very well optimized, and with LTO, they are less so (because it results in quite a large binary), but your change drives the compiler to optimize a little more.

heshpdx commented 5 months ago

No, we replaced our flags with your flags. I confirmed that the gains are not coming from LTO by removing -flto and rerunning; the asm produced is isolated to one file. Which machine are you testing on? My 7 year old x86 desktop did not show the gains I listed above.

ktmf01 commented 5 months ago

I've tested with an Intel Xeon E-2224G and an Intel Core i5 7200U. Results for the latter are graphed here.

No difference in speed, at the cost of a bigger binary. Compile with GCC 13.2.0.

Just to be sure, are you using 16-bit, 24-bit input or something else entirely?

heshpdx commented 5 months ago

That's a cool graph! Can you explain the axes? The title says "CPU-time vs compression", so I thought that means "Y-axis vs X-axis", but the axes are labeled differently (Compression is on Y-axis, and CPU-time is not plotted but CPU-usage is?). Is this from one run or twelve runs? If you share your input and cmdlines, I can work to produce the same data on my side and send it over.

It's good that you reproduced my Intel results :-) The desktop machine I mentioned above an i7-8809G. The reason behind the lack of improvement is that the cores in older x86 machines don't have as many execution units as the newer ones. Hence they cannot take advantage of the increased instruction level parallelism that we are exposing with the two-sum patch.

My inputs are 32-bit, WAV files. If it helps, I have placed them on a dropbox here. I have full rights to the audio content, I produced these myself, and I am giving them away freely for any use.

heshpdx commented 4 months ago

Hi @ktmf01! I took your changes from !700 and !702, applied them on top of the changes here, and observed some more performance on my dev machine (Ampere Altra). I ran a couple of times to make sure. My -8Vp cmdline showed +1.7% and the -7Vep cmdline showed a perhaps negligible +0.1%. We're going to take all of these patches and put them in the CPUv8 development mainline to exercise it over a wide variety of compilers and systems. Thanks.

heshpdx commented 3 months ago

I want to share that the SPEC CPU development committee has been testing with this edit for a month, without any issues across a variety of hardware and compilers (AMD AOCC, Intel ICC, NVHPC, IBM OpenXL, MSVC, LLVM, GCC) and they are all stable and verify with the same results. And we see performance gains on some platforms. What are the next steps on your side? Anything else I can help with?

ktmf01 commented 3 months ago

To be honest, I'm not yet convinced. I think that a gain of 1.7% with non-standard compiler options, non-standard compression settings and somewhat uncommon hardware is not something worth integrating.

Also, I found a system where this patch results in slower execution. I tested on a BCM2835 (Raspberry Pi Zero W) with options -8ep and standard compiler settings, and I got an 1.2% decrease in execution speed.

xiph / flac

Enable better vectorization for generic convolution #692