xemu-project / xemu

Original Xbox Emulator for Windows, macOS, and Linux (Active Development)
https://xemu.app
Other
2.74k stars 275 forks source link

Laggy FMVs #573

Open RageXbox opened 2 years ago

RageXbox commented 2 years ago

Bug Description

Most games with FMVs lags and FMV audio lags too. Even if you’re on a most powerful hardware. Causes Halo 2 Epilogue to skip.

Expected Behavior

FMVs should not lag and audio should not lag at all.

xemu Version

v0.6.2-9-g69ceec4446

System Information

Windows 10 (64-bit) (Intel(R) Xeon(R) CPU E5-2678 v3 @2.50 GHz) (NVIDIA Quadro P5000) (NVIDIA 472.47)

Additional Context

https://user-images.githubusercontent.com/76413188/143661249-f11a1cb2-ee12-4170-8e87-06770a21639a.mp4

Blackbird88 commented 2 years ago

From my experience this affects all games using XMV video format. Games like GTAIII/VC which use Bink Video (.BIK) are fine.

icculus commented 2 years ago

Just adding some notes here, because I'm seeing this too.

My assumption is that XMV decoding is just a software library linked into the game, and complex/higher-res videos are CPU-bound, and thus is at the mercy of qemu's efficiency.

(unless, I suppose, the texture upload performance can be improved, but I haven't dug into that at all.)

As a test, I ran xemu under Linux's "perf record" tool. I used "Castlevania - Curse of Darkness", since it has a really long XMV near startup that doesn't keep up, and just raced through the menus to start the game without a save created...then I just let perf collect samples while the video didn't keep up for a minute or two.

These are the biggest CPU hotspots:

+   11.99%     0.00%  xemu          [unknown]                [.] 0x0000000183e5f000
+   10.34%     0.22%  xemu          [unknown]                [.] 0000000000000000
+    7.58%     0.47%  xemu          libc.so.6                [.] clock_gettime@@GLIBC_2.17
+    4.43%     0.00%  xemu          [vdso]                   [.] 0x00007ffd3c1ab6e8
+    4.40%     4.40%  xemu          [vdso]                   [.] 0x00000000000006e5
+    3.66%     3.50%  xemu          xemu                     [.] helper_cvtps2pi
     3.51%     3.38%  xemu          xemu                     [.] helper_psrad_mmx
     2.92%     2.79%  xemu          xemu                     [.] float32_add
+    2.89%     2.71%  xemu          xemu                     [.] float32_mul
     2.83%     2.81%  xemu          xemu                     [.] helper_packuswb_mmx
+    2.79%     1.43%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0ee01
     2.41%     2.28%  xemu          xemu                     [.] helper_punpcklwd_mmx
+    2.38%     0.01%  xemu          [unknown]                [k] 0xffffffffb160007c
+    2.34%     0.00%  xemu          [unknown]                [.] 0x0000000000000190
+    2.33%     0.00%  xemu          [unknown]                [.] 0x70632d363833692d
+    2.33%     0.00%  xemu          [unknown]                [.] 0x00007ff0f40c7ab0
+    2.29%     2.25%  xemu          xemu                     [.] cpu_exec
+    2.27%     0.01%  xemu          [unknown]                [k] 0xffffffffb142c331
     2.11%     2.03%  xemu          xemu                     [.] helper_psllq_mmx
+    1.75%     1.62%  xemu          xemu                     [.] helper_lookup_tb_ptr
+    1.74%     0.00%  xemu          xemu                     [.] 0x00005618e0bf77f0
+    1.73%     1.65%  xemu          xemu                     [.] soft_f32_mul
+    1.72%     1.48%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0edf3
+    1.55%     0.07%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0edf6
     1.50%     1.37%  xemu          xemu                     [.] helper_packssdw_mmx
     1.47%     1.42%  xemu          xemu                     [.] helper_pslld_mmx
+    1.45%     0.01%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0ee12
+    1.44%     0.00%  xemu          [unknown]                [.] 0x00000000000000e8
+    1.44%     1.44%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0ee0e
+    1.44%     0.00%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0ee05
+    1.44%     0.00%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0edd0
+    1.43%     1.43%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0edce
+    1.43%     1.42%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0ede7
+    1.43%     0.00%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0eded
+    1.43%     0.00%  xemu          [unknown]                [.] 0x00007ff0f40c7a00
+    1.43%     1.40%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0edda
+    1.43%     0.01%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0edc0
+    1.42%     1.42%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0ee3b
+    1.41%     0.00%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0eddd
+    1.37%     1.36%  xemu          [JIT] tid 2630951        [.] 0x00007ff0b4d0edfd
+    1.12%     0.00%  xemu          [unknown]                [.] 0x4781270047812700
+    1.12%     0.00%  xemu          [unknown]                [.] 0xc6483000c6483000
+    1.06%     0.00%  xemu          libc.so.6                [.] __GI___ioctl_time64
     1.04%     0.93%  xemu          xemu                     [.] helper_mulps
     1.02%     0.96%  xemu          xemu                     [.] helper_paddw_mmx
+    1.01%     0.91%  xemu          xemu                     [.] parts64_round_to_int_normal.constprop.0
+    1.00%     0.00%  xemu          [unknown]                [.] 0xffffffffb0b3e981

(everything below this is < 1 percent of the CPU time.)

The things without symbols are probably JIT'd code from qemu, which would make sense if this is just chewing through video data on the CPU.

The most notable thing here is the clock_gettime() call taking seven percent of the processing time. That is probably worth exploring! If this is just the loop in sdl2_gl_refresh, though, it means we're not spending much time on the CPU per-frame and it's just spinning to keep the rendering at 60Hz, and the CPU emulation is not the problem in this case because we're clearly waiting around with nothing to do.

After that, it's worth looking at helper_cvtps2pi, float32_add, etc, and seeing if there's something that will make them faster. Likely small tweaks in there will result in significant improvements due to the likely sheer volume of these function calls per-frame. Is there some way we can inline these functions (or coerce the JIT to inline them instead of calling them as functions?). Or get them to overlay actual SIMD instructions if they are faking them without it?

Also I'm curious if the decoding library was intended to use the GPU to aid in decoding and is falling back to software decoding because something isn't hooked up, but this era of gaming might not have been that savvy about GPU processing (or not had enough GPU to do it), but maybe it turns out there's some basic check that is unexpectedly failing and putting us on a slow path that actual consoles never used. Don't know.

Failing all else: maybe we detect this library and use HLE to handle decoding in native code without going through qemu...? Seems like a big lift and a lot of trouble, but I guess it's a possible solution.

Anyhow, I have no answers or patches at the moment, just adding information and ideas to this bug report.

Triticum0 commented 2 years ago

This issue Is horrible for unskippable cutscenes