Closed ktmf01 closed 2 months ago
Results for Intel Xeon E-2224G. Input data is this set of samples concatenated into a single file for the 16-bit input. The 24-bit input is created with sox
, upsampling to 96000Hz in 24-bit. The upsampling process 'fills' the 8 extra bits.
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./mulbits -5 -j2 -c ../Rarewares-16bit.flac |
1.210 ± 0.018 | 1.189 | 1.242 | 1.00 |
./current -5 -j2 -c ../Rarewares-16bit.flac |
1.214 ± 0.025 | 1.189 | 1.253 | 1.00 ± 0.03 |
./mulbits -8 -j2 -c ../Rarewares-16bit.flac |
1.988 ± 0.055 | 1.933 | 2.089 | 1.64 ± 0.05 |
./current -8 -j2 -c ../Rarewares-16bit.flac |
2.008 ± 0.047 | 1.948 | 2.068 | 1.66 ± 0.05 |
./mulbits -8p -j2 -c ../Rarewares-16bit.flac |
6.319 ± 0.198 | 6.111 | 6.786 | 5.22 ± 0.18 |
./current -8p -j2 -c ../Rarewares-16bit.flac |
6.306 ± 0.199 | 6.171 | 6.851 | 5.21 ± 0.18 |
./mulbits -5 -j2 -c ../Rarewares-24bit.flac |
2.638 ± 0.062 | 2.567 | 2.772 | 2.18 ± 0.06 |
./current -5 -j2 -c ../Rarewares-24bit.flac |
2.917 ± 0.069 | 2.828 | 3.040 | 2.41 ± 0.07 |
./mulbits -8 -j2 -c ../Rarewares-24bit.flac |
10.796 ± 0.199 | 10.622 | 11.180 | 8.92 ± 0.21 |
./current -8 -j2 -c ../Rarewares-24bit.flac |
10.860 ± 0.135 | 10.728 | 11.171 | 8.98 ± 0.18 |
./mulbits -8p -j2 -c ../Rarewares-24bit.flac |
76.075 ± 1.140 | 75.110 | 78.038 | 62.88 ± 1.34 |
./current -8p -j2 -c ../Rarewares-24bit.flac |
83.079 ± 1.171 | 81.800 | 85.769 | 68.67 ± 1.43 |
Largest difference is for preset 5 with 24-bit input: 11% faster. For 8p the difference is also large, 9% faster.
I'm somewhat baffled the difference is so large for preset 5, I am not sure why that is. A few weeks ago I was under the impression the limit residual functions were only seldomly used, and only at higher presets (because of the higher orders). So I reran this benchmark with preset -5p, and indeed the difference is huge
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./mulbits -5p -j2 -c ../Rarewares-24bit.flac |
5.941 ± 0.060 | 5.883 | 6.050 | 1.00 |
./current -5p -j2 -c ../Rarewares-24bit.flac |
8.780 ± 0.036 | 8.737 | 8.836 | 1.48 ± 0.02 |
Results of running gprof against the -5p
preset mentioned above
Without this patch (running over the input thrice)
time seconds seconds calls s/call s/call name
71.02 36.51 36.51 3122034 0.00 0.00 FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
[...]
0.56 49.75 0.29 218268 0.00 0.00 FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
[...]
0.21 50.87 0.11 25962 0.00 0.00 FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
With this patch
time seconds seconds calls s/call s/call name
42.65 14.50 14.50 1274172 0.00 0.00 FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
11.15 18.29 3.79 1376313 0.00 0.00 FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
[...]
3.03 27.57 1.03 715779 0.00 0.00 FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
So, without the patch, the wide_intrin_avx2 function variant is only called 1% of the time, but with the patch, it is called more than half of the time. The number of times the non-wide variant is called is more than tripled.
This change should make 24-bit encoding faster, because the
limit_residual
variant of residual computation is used less often.