xiph / flac

Free Lossless Audio Codec
https://xiph.org/flac/
GNU Free Documentation License v1.3
1.58k stars 278 forks source link

Improve calculation of when to use wide residual computation #700

Closed ktmf01 closed 2 months ago

ktmf01 commented 2 months ago

This change should make 24-bit encoding faster, because the limit_residual variant of residual computation is used less often.

ktmf01 commented 2 months ago

Results for Intel Xeon E-2224G. Input data is this set of samples concatenated into a single file for the 16-bit input. The 24-bit input is created with sox, upsampling to 96000Hz in 24-bit. The upsampling process 'fills' the 8 extra bits.

Command Mean [s] Min [s] Max [s] Relative
./mulbits -5 -j2 -c ../Rarewares-16bit.flac 1.210 ± 0.018 1.189 1.242 1.00
./current -5 -j2 -c ../Rarewares-16bit.flac 1.214 ± 0.025 1.189 1.253 1.00 ± 0.03
./mulbits -8 -j2 -c ../Rarewares-16bit.flac 1.988 ± 0.055 1.933 2.089 1.64 ± 0.05
./current -8 -j2 -c ../Rarewares-16bit.flac 2.008 ± 0.047 1.948 2.068 1.66 ± 0.05
./mulbits -8p -j2 -c ../Rarewares-16bit.flac 6.319 ± 0.198 6.111 6.786 5.22 ± 0.18
./current -8p -j2 -c ../Rarewares-16bit.flac 6.306 ± 0.199 6.171 6.851 5.21 ± 0.18
./mulbits -5 -j2 -c ../Rarewares-24bit.flac 2.638 ± 0.062 2.567 2.772 2.18 ± 0.06
./current -5 -j2 -c ../Rarewares-24bit.flac 2.917 ± 0.069 2.828 3.040 2.41 ± 0.07
./mulbits -8 -j2 -c ../Rarewares-24bit.flac 10.796 ± 0.199 10.622 11.180 8.92 ± 0.21
./current -8 -j2 -c ../Rarewares-24bit.flac 10.860 ± 0.135 10.728 11.171 8.98 ± 0.18
./mulbits -8p -j2 -c ../Rarewares-24bit.flac 76.075 ± 1.140 75.110 78.038 62.88 ± 1.34
./current -8p -j2 -c ../Rarewares-24bit.flac 83.079 ± 1.171 81.800 85.769 68.67 ± 1.43

Largest difference is for preset 5 with 24-bit input: 11% faster. For 8p the difference is also large, 9% faster.

I'm somewhat baffled the difference is so large for preset 5, I am not sure why that is. A few weeks ago I was under the impression the limit residual functions were only seldomly used, and only at higher presets (because of the higher orders). So I reran this benchmark with preset -5p, and indeed the difference is huge

Command Mean [s] Min [s] Max [s] Relative
./mulbits -5p -j2 -c ../Rarewares-24bit.flac 5.941 ± 0.060 5.883 6.050 1.00
./current -5p -j2 -c ../Rarewares-24bit.flac 8.780 ± 0.036 8.737 8.836 1.48 ± 0.02
ktmf01 commented 2 months ago

Results of running gprof against the -5p preset mentioned above

Without this patch (running over the input thrice)

 time   seconds   seconds    calls   s/call   s/call  name
 71.02     36.51    36.51  3122034     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
 [...]
  0.56     49.75     0.29   218268     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
[...]
  0.21     50.87     0.11    25962     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2

With this patch

 time   seconds   seconds    calls   s/call   s/call  name
 42.65     14.50    14.50  1274172     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
 11.15     18.29     3.79  1376313     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
[...]
  3.03     27.57     1.03   715779     0.00     0.00  FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2

So, without the patch, the wide_intrin_avx2 function variant is only called 1% of the time, but with the patch, it is called more than half of the time. The number of times the non-wide variant is called is more than tripled.