Difference in encoding CELT media between AMD and Intel CPUs in floating-point mode

dcrooks-ak commented 2 years ago

Hello,

We recently made note of an issue where encoding CELT media in floating-point mode was giving slightly different results on AMD and Intel CPUs. The difference arose due to the use of _mm_rcp_ps and _mm_rsqrt_ps in op_pvq_search_sse2, see: https://github.com/xiph/opus/blob/master/celt/x86/vq_sse2.c#L108 https://github.com/xiph/opus/blob/master/celt/x86/vq_sse2.c#L173

These intrinsics/instructions are defined to be approximations, and permitted to have slightly different implementations across CPUs. In order to better produce similar results for offline-generated media across different CPU architectures, it would be good to replace these with _mm_div_ps and _mm_sqrt_ps which have stricter definitions on their behaviour.

The following should be logically similar to the lines above, and harmonize the results across architectures:

For vq_sse2.c#L108: rcp4 = _mm_mul_ps(_mm_set_ps1((float)(K+.8)), _mm_div_ps(_mm_set_ps1(1.f), sums)); And for vq_sse2.c#L173: y4 = _mm_div_ps(_mm_set_ps1(1.f), _mm_sqrt_ps(y4));

jmvalin commented 2 years ago

Duplicate of #259

dcrooks-ak commented 2 years ago

Ah, sorry, I skimmed the list of Open issues but not the Closed ones, and did not see that.

Contrary to the sentiment discussed in #259, however, I think it would still be desirable to at least have the same generated code generate the same results across the same CPU architecture, even if the results may still change across compilations, or on different compilers, etc. At least then the same generated executable/lib will still generate similar results across, say, a build farm with heterogeneous CPU vendors, without having to flip over to the fixed-point encoder. Therefore, I'm a bit reluctant to close up the issue personally, but if you want to close it up as WNF, then I won't sweat it.

Thanks for the time, anyway.

jmvalin commented 2 years ago

You already cannot achieve that due to run-time CPU detection being to select different code depending on cpuid. That being said, with the current version it may be possible to get rid of the specific use of reciprocal approximations with --disable-intrinsics.

dcrooks-ak commented 1 year ago

Hi,

Since I wrote this up, I would actually like to raise the severity of this issue a bit.

I've talked with a lot of other developers, and can definitively say that many folks are hitting this issue in some way or another, anywhere from small game teams like Haemimont Games all the way to many of the larger game teams at Activision, (e.g. basically all of the Call of Duty projects) and everything inbetween, including many game developers using middleware such as Fmod, (see https://www.fmod.com/docs/2.01/studio/welcome-to-fmod-studio-revision-history.html scroll down to "build 128256") Wwise (see https://www.audiokinetic.com/en/library/edge/?source=SDK&id=whatsnew_2022_1.html scroll down to "WG-60959") and/or Unreal to handle audio asset encoding.

Even if determinism across builds cannot be guaranteed -- again, understandable -- many game teams at least require the same generated code at one point in time to generate the same result. Given that many teams rely on, for example, content-addressable-storage caches where assets are keyed based on their input (perhaps even using the generated code for the asset encoder as a part of the hashing for the CAS key) or check for binary diffs in encoded media before submitting results to perforce (which may result in multiple members of the game team constantly having different data all the time) then this is going to continue to be a problem as adoption of Opus in games increases, and as AMD CPUs start to be used in more workstations and data centers.

(I'll add that I'm not terribly convinced on the run-time CPUID detection being much of an issue here, because across the high-end x86 space, most of the instruction set support is similar enough to not be a thing -- AVX2 support vs AVX512 support being the only expected differentiation there, and even then, codepaths across those two tiers can be written to generate similar results)

Arguably, the only thing mitigating this from becoming more of a problem over time is the fact that many folks have already stumbled into this exact issue, and resolved it all the same way, by diverging these two lines of code. So, it would be very appreciated to have this fixed canonically 🙏

rygorous commented 1 year ago

Just to add to the pile, I work at Epic Games and we also root-caused Opus' use of RCPSS and RSQRTSS as a source of content build divergences between Intel and AMD machines.

This topic came up recently on a private game developer forum where we soon noticed that at least 3 game audio libraries (Audiokinetic Wwise, FMOD, Miles Sound System) had run into and debugged this exact problem independently, and talking about it some devs from Haemimont saw the thread, had a light bulb moment, and confirmed that they were affected by a Opus encoding mystery determinism bug too that they had not been able to hunt down. When they checked, it turned out to be this.

eberan commented 1 year ago

Adding a voice for Activision studios here. We hit this issue last year as others did, with build machines' output differing between AMD and Intel.

jmvalin commented 1 year ago

If you need determinism, then I think you should use a fixed-point build. Disabling reciprocal approximations will not fully solve the problem and you'll be hit by even more subtle issues later on. Basically, even if you are making a single build, as soon as you have run-time CPU detection enabled, you are in effect having multiple builds with no guarantee that the results will be the same. And maybe they will actually be the same today, but a new compiler (or new AVX2 code) will break it without warning.

xiph / opus

Difference in encoding CELT media between AMD and Intel CPUs in floating-point mode #260