Improve calculation of when to use wide residual computation

ktmf01 commented 2 months ago

This change should make 24-bit encoding faster, because the limit_residual variant of residual computation is used less often.

ktmf01 commented 2 months ago

Results for Intel Xeon E-2224G. Input data is this set of samples concatenated into a single file for the 16-bit input. The 24-bit input is created with sox, upsampling to 96000Hz in 24-bit. The upsampling process 'fills' the 8 extra bits.

Command	Mean [s]	Min [s]	Max [s]	Relative
`./mulbits -5 -j2 -c ../Rarewares-16bit.flac`	1.210 ± 0.018	1.189	1.242	1.00
`./current -5 -j2 -c ../Rarewares-16bit.flac`	1.214 ± 0.025	1.189	1.253	1.00 ± 0.03
`./mulbits -8 -j2 -c ../Rarewares-16bit.flac`	1.988 ± 0.055	1.933	2.089	1.64 ± 0.05
`./current -8 -j2 -c ../Rarewares-16bit.flac`	2.008 ± 0.047	1.948	2.068	1.66 ± 0.05
`./mulbits -8p -j2 -c ../Rarewares-16bit.flac`	6.319 ± 0.198	6.111	6.786	5.22 ± 0.18
`./current -8p -j2 -c ../Rarewares-16bit.flac`	6.306 ± 0.199	6.171	6.851	5.21 ± 0.18
`./mulbits -5 -j2 -c ../Rarewares-24bit.flac`	2.638 ± 0.062	2.567	2.772	2.18 ± 0.06
`./current -5 -j2 -c ../Rarewares-24bit.flac`	2.917 ± 0.069	2.828	3.040	2.41 ± 0.07
`./mulbits -8 -j2 -c ../Rarewares-24bit.flac`	10.796 ± 0.199	10.622	11.180	8.92 ± 0.21
`./current -8 -j2 -c ../Rarewares-24bit.flac`	10.860 ± 0.135	10.728	11.171	8.98 ± 0.18
`./mulbits -8p -j2 -c ../Rarewares-24bit.flac`	76.075 ± 1.140	75.110	78.038	62.88 ± 1.34
`./current -8p -j2 -c ../Rarewares-24bit.flac`	83.079 ± 1.171	81.800	85.769	68.67 ± 1.43

Largest difference is for preset 5 with 24-bit input: 11% faster. For 8p the difference is also large, 9% faster.

I'm somewhat baffled the difference is so large for preset 5, I am not sure why that is. A few weeks ago I was under the impression the limit residual functions were only seldomly used, and only at higher presets (because of the higher orders). So I reran this benchmark with preset -5p, and indeed the difference is huge

Command	Mean [s]	Min [s]	Max [s]	Relative
`./mulbits -5p -j2 -c ../Rarewares-24bit.flac`	5.941 ± 0.060	5.883	6.050	1.00
`./current -5p -j2 -c ../Rarewares-24bit.flac`	8.780 ± 0.036	8.737	8.836	1.48 ± 0.02

ktmf01 commented 2 months ago

Results of running gprof against the -5p preset mentioned above

Without this patch (running over the input thrice)

 time   seconds   seconds    calls   s/call   s/call  name
 71.02     36.51    36.51  3122034     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
 [...]
  0.56     49.75     0.29   218268     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
[...]
  0.21     50.87     0.11    25962     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2

With this patch

 time   seconds   seconds    calls   s/call   s/call  name
 42.65     14.50    14.50  1274172     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
 11.15     18.29     3.79  1376313     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
[...]
  3.03     27.57     1.03   715779     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2

So, without the patch, the wide_intrin_avx2 function variant is only called 1% of the time, but with the patch, it is called more than half of the time. The number of times the non-wide variant is called is more than tripled.

xiph / flac

Improve calculation of when to use wide residual computation #700