Splinter cell double agent: libnvidia-glcore.so crash

Fabxx commented 2 years ago

Bug Description

While playing SC:DA, on sea of ohkoskh level or JBA HQ part 1/2, i always get this stack in the same spots no matter what: Nvidia_GL_crash_log.txt

(log is from 0.6.2 master, but it happens on the lastes master as well with the same nv2a instructions)

Spots screen: JBA HQ Downstairs

Crawl space beginning of sea of okhosk:

Expected Behavior

This behavior shouldn't happen and should not affect nvidia GL driver.

xemu Version

0.6.3-8-g30a872fa83

System Information

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz OS Platform: Linux OS Version: Manjaro Linux Manufacturer: NVIDIA Corporation GPU Model: NVIDIA GeForce GTX 970/PCIe/SSE2 Driver: 4.0.0 NVIDIA 510.60.02 Shader: 4.00 NVIDIA via Cg compiler

Additional Context

No response

Fabxx commented 2 years ago

Update: apparently this crash is not dependant on the EEPROM since restoring the backup doesn't fix this anymore, and i've experienced these crashes only in SC_DA and CT, so i'm changing the title to make it more appropriate to the context

abaire commented 2 years ago

I was able to reproduce this with @revix-0 's help, also on an NVIDIA card in Linux (1070, driver is older, 470.103.01)

Start a new game and progress into the ice caves until just after you have to boost your teammate up onto the ledge (just after eliminating the first enemy).
Save the game state in the monitor
Progress a bit farther until the dialog starts
Immediately turn around and run back to the cave entrance

In my case this did not crash, however, if I restart xemu (close the program entirely and reopen), then load the state, then repeat steps 3 and 4, I get a consistent crash.

Furthermore, once this has happened, it will happen on a new game without using the save state (indicating that something is likely corrupted in the cache).

Running in the debugger, I get a segfault in pgraph.c's pgraph_upload_surface_data on the final glTexImage2D call.

width: 447 height: 447 surface->fmt: fmt = {SurfaceFormatInfo} bytes_per_pixel = {unsigned int} 2 [0x2] gl_internal_format = {GLint} 33189 [0x81a5] (GL_DEPTH_COMPONENT16) gl_format = {GLenum} 6402 [0x1902] (GL_DEPTH_COMPONENT) gl_type = {GLenum} 5123 [0x1403] (GL_UNSIGNED_SHORT) gl_attachment = {GLenum} 36096 [0x8d00] (GL_DEPTH_ATTACHMENT)

gl_read_buf is 0x7fff2b613460 and contains plausible looking data.

The s are presumably the nvidia driver. Note that I was testing this on my work branch which has some additional PRs applied, but the original report was against master.

0x00007fffee6a5cbb 0x00007fffee1c285d 0x00007fffee1cc871 0x00007fffee2e1730 0x00007fffee2ecb7d 0x00007fffee2ee0f4 0x00007fffee2ee478 pgraph_upload_surface_data pgraph.c:5750 pgraph_update_surface pgraph.c:6147 pgraph_NV097_CLEAR_SURFACE_handler pgraph.c:3566 pgraph_method pgraph.c:1149 pfifo_run_puller pfifo.c:230 pfifo_run_pusher pfifo.c:342 pfifo_thread pfifo.c:494 qemu_thread_start qemu-thread-posix.c:541 start_thread 0x00007ffff6733947 clone 0x00007ffff67c3a44

Fabxx commented 2 years ago

UPdate: looks like that reinstalling the drivers and running xemu on first boot fixes this thing, but as soon you restart it, the crashes start again, i tested with 3XX, 4XX and 5XX drivers

UPdate 2: other steps to reproduce the crash: -Reinstall the NVIDIA driver -Run xemu for the first time, it will not crash on first run after a clean install. -Cose it entirely, even without loading a game first -Reopen xemu, load double agent, do the steps of abaire -Crash

abaire commented 2 years ago

Importantly, you do not need to mess with save states at all and the bad state persists even with a clean HDD image.

Reinstall your NVIDIA driver (I have not personally tested this part)
Download the xemu stub HDD image from the website
Start xemu w/ the stub image so we know that we have a totally clean guest state.
Close xemu without loading any game
Start xemu w/ SC:DA, start a new game, notice that you'll need to create a profile (confirming that the HDD image is blank)
Progress through the first level, past the part where you get pulled up to the ledge, until you get the dialog about the helicopter arriving
Immediately turn around and run back towards the entrance

xemu will crash reliably with the stack trace above.

@revix-0 mentioned that if you reinstall the driver, use a blank HDD image, play any game and then load SC:DA without restarting xemu, the crash is not reproducible.

Once in a crashing state, the crash seems mostly but not 100% reproducible on my machine. I have had a couple instances where I was able to get all the way back to the starting area without a crash, but more cases where it does crash in the same way.

Fabxx commented 2 years ago

Update: a user tested this on AMD RX 560 and it doesn't crash, also a note on the HDD, you don't need a blank image to avoid the crash, reinstalling the driver and run xemu on first run is sufficient, on second run it will crash anyways. we suspect that this is a native NVIDIA issue

BigBrainAFK commented 2 years ago

I personally could not reproduce this with XEMU 0.6.6, NVIDIA 512.59 Drivers and a 2080ti

xemu_2022-05-03_19-00-09

I only tried the snapshot method although after retrying from a clean HDD three times I was not able to trigger this.

Fabxx commented 2 years ago

UPDATE: by removing this if condition in the pgraph.c and by making the rendering slower, the race condition doesn't happen anymore and the libnvidiagl crash doesn't happen on the levels that cause this. AN alternative is to play with upscaling at 2x so it slows down a bit and doesn't crash either.


 if (!pgraph_surface_to_texture_can_fastpath(surface, texture_shape)) {
        pgraph_render_surface_to_texture_slow(d, surface, texture,
                                              texture_shape, texture_unit);
        return;
    }

Fabxx commented 2 years ago

UPDATE: looks like that it always crashes at this instruction:

GLDBG[MARKER][NOTIFICATION]> nv2a: pgraph method (0): 0x97 -> 0x1d94 NV097_CLEAR_SURFACE[0] 0x1

where there's a z24s8-fixed integer that it's being handled, possibily a race condition while the game swaps between 32-bit and 16-bit texture buffers, or there's a incorrect handling of the fixed integer

UPDATE 2: it's a 16 bit texture and the Z buffer state remains at 1 which is correct.

nv2a: [RAM->GPU] ZETA (lin) surface @ 28dc000 (w=457,h=457,p=1024,bpp=2) the crash always happens with a 16 bit texture with 1024 of pitch

We suspect that the texure is being destroyed before upload

UPDATE 3: the main functions where it crashes following GDB stack:

pgraph_upload_surface_data(d, pg->zeta_binding, false);

glTexImage2D(GL_TEXTURE_2D, 0, surface->fmt.gl_internal_format, width, height, 0, surface->fmt.gl_format, surface->fmt.gl_type, gl_read_buf);

pgraph_update_surface(d, true, write_color, write_zeta)

Fabxx commented 2 years ago

Adding the log with the 6 GDB Frames of the nvidia driver while processing the data: libnvidia frames.txt

xemu-project / xemu