xenia-project / xenia

Xbox 360 Emulator Research Project
https://xenia.jp
Other
8.1k stars 1.12k forks source link

Vulkan-based GPU emulation implementation #2028

Open Triang3l opened 2 years ago

Triang3l commented 2 years ago

For GPU emulation on GNU/Linux and Android, as well as Windows 8, and Windows 7 until 12on7 is integrated into Xenia, it's necessary to use either OpenGL or Vulkan.

OpenGL and OpenGL ES can be considered out of the question for various reasons, such as long GLSL compilation times, difficulties in caching and restoring combinations of shaders and states (state application may be deferred until a draw call) to prevent stuttering, extremely hacky multithreading making decoupling of GPU emulation and UI thread presentation nearly impossible. So, a Vulkan-based GPU emulation implementation is needed.

The quick route, running the Direct3D 12 backend on VKD3D, currently isn't functional due to what appears to be bugs in VKD3D (such as geometry stage linkage errors) and because of certain unimplemented features (like 32x32=64 integer multiplication in shaders that we use for integer division by a constant, in EDRAM tile addressing). We most definitely must not be trying to work around those issues in our code, as that will harm both Xenia (as hacks will be added to the code) and VKD3D, which will lose a test case for certain relatively obscure parts of Direct3D 12. At some point, those issues should be fixed inside VKD3D.

However, even if VKD3D was feature-complete enough for Xenia to work on it, it will still be able to provide only GNU/Linux support on the PC to us. VKD3D requires a Vulkan implementation that supports everything that's mandatory in Direct3D 12, and also makes use of Vulkan extensions designed specifically for it, such as mutable-type descriptors and swizzled 4:4:4:4 formats. Android is a completely different world, however. While the Direct3D 12 API requires the hardware to support at least the Direct3D feature level 11_0 (plus descriptor indexing and SRV swizzle), the baseline requirements of Vulkan match OpenGL ES 3.1, which are somewhat above, somewhat below those of the Direct3D feature level 10_0 — and it seems like there are simply no optional Vulkan features that are supported universally by all Vulkan implementations on Android.

While porting Xenia to just the PC subset of Vulkan implementations is straightforward, what will make this project fun and challenging is providing fallbacks for all optional features, so most games can be played on Xenia even on the base Vulkan 1.0 implementation with no optional features and the lowest possible values of the limits, and also with all VK_KHR_portability_subset features disabled. There seems to be nothing that completely blocks the core architecture of Xenia's GPU emulation from running on the base Vulkan 1.0 configuration, fallbacks just need to be written.

Note that supporting Vulkan does not mean dropping Direct3D. I'm planning to support both the Direct3D 12 and the Vulkan backends equally actively at least for as long as I'm working on Xenia. The main functional reason to support both is that on non-deprecated Microsoft's platforms, Direct3D 12 is supported more widely than Vulkan. Configurations that support Direct3D 12, but not Vulkan, include the UWP on the Xbox One and the Xbox Series (the UWP on the Xbox One also requires legacy DXBC shaders, not DXIL), Windows devices on Qualcomm's SoCs (which will naturally become one of the targets once Arm CPUs are supported by Xenia in general), Windows 10/11 on the Nvidia Fermi architecture (Vulkan is supported only starting with Kepler) and on Intel HD Graphics 4200+ (no Vulkan support until some generation of the Intel UHD Graphics — and, as far as I know, the Intel HD Graphics drivers require DXBC shaders, though I'm not entirely sure). Also, one of the backends can be used as a backup one if there are issues with another (primarily when driver bugs are encountered). Certain new features are added more quickly to the Direct3D 12 API than to Vulkan. For Xenia, the most prominent example is the rasterizer-ordered views / fragment shader interlock used for high-accuracy, strict emulation path (which was the default before the host render target emulation path redesign) for the framebuffer logic of the Xenos with all its custom pixel formats. This is a feature that existed in Direct3D 12 from its very beginning, but Vulkan has received it only in 2019 — when Xenia has already been heavily utilizing it for half a year. And even now, because Vulkan doesn't have a concept of feature levels with mandatory features unlike Direct3D, AMD refuse to implement this functionality on Vulkan and OpenGL, even though it already works on their hardware on Direct3D and Metal. Additionally, to keep the project maintainable while supporting multiple GPU APIs, we need to place as much code as we can in API-agnostic classes rather than in API-specific implementations — something I haven't really been doing while working on the current architecture of GPU emulation. This will mean that adding support for other GPU APIs, such as Metal, maybe even WebGPU in case we decide to bring Xenia to the web if web browsers provide sufficient functionality, for instance, for thread synchronization, will become more straightforward in the future. As a side note, Metal will also benefit from our 100% Vulkan device coverage approach as it doesn't expose, in particular, geometry shaders.

The porting work is done on the vulkan branch, and it can be separated into multiple stages, each with their core goal and a checklist of things that need to be implemented. These don't have to be done strictly in order though as they don't depend on each other as a whole, and the Vulkan development branch won't be merged into the main one until the Vulkan backend is at least on par with Direct3D 12 anyway (update: it has been merged when it has become functionally roughly on par with the previous Vulkan backend for ease of maintenance).

Stage 1: From draw commands to presentation

At the first stage, the emulator should be made interactive so testing can be done much more rapidly than by just taking RenderDoc captures after blindly doing some sequence of button presses.

To achieve that, we need to get all parts of the frame architecture, displayed below, of Xenia to be functional at least at some basic level required to pass data between the stages, but disregarding details of the subsystems themselves.

The lines in the graph reflect the current state of the data flow in the Vulkan backend in Xenia — thick lines represent the data paths already fully implemented in Xenia on Vulkan (though the subsystems themselves that they link may still be in a very early state).

flowchart TD
    PrimitiveProcessor[Primitive processor]
    SharedMemory[Shared memory]
    TextureCache[Texture cache]
    DrawPipeline[Draw command pipeline]
    RenderTargetCache[Render target cache]
    Resolve[Render target to texture resolve]
    Presentation[Presentation]

    PrimitiveProcessor ==>|&gt Topology, converted index buffers| DrawPipeline

    SharedMemory ===|&lt Tiled framebuffer texture data| Resolve

    SharedMemory ===|&gt Index and vertex buffers<br/>&lt Memory export| DrawPipeline
    SharedMemory ==>|&gt Guest texture data| TextureCache
    TextureCache ==>|&gt Host textures| DrawPipeline

    DrawPipeline ==>|&gt Fragment depth/stencil and color| RenderTargetCache

    RenderTargetCache ==>|&gt Render target data| Resolve

    TextureCache =====>|&gt Host frontbuffer texture| Presentation

style PrimitiveProcessor fill:PaleGreen,stroke:Green
style SharedMemory fill:PaleGreen,stroke:Green
style TextureCache fill:PaleTurquoise,stroke:Turquoise
style DrawPipeline fill:PaleTurquoise,stroke:Turquoise
style RenderTargetCache fill:PaleTurquoise,stroke:Turquoise
style Resolve fill:PaleTurquoise,stroke:Turquoise
style Presentation fill:PaleTurquoise,stroke:Turquoise
linkStyle 0 stroke:Lime
linkStyle 1 stroke:Lime
linkStyle 2 stroke:Lime
linkStyle 3 stroke:Lime
linkStyle 4 stroke:Lime
linkStyle 5 stroke:Lime
linkStyle 6 stroke:Lime
linkStyle 7 stroke:Lime

(The link between render target resolving and the shared memory should be a backward one, but Mermaid places the shared memory in the middle if such a link is created — assume that there's an arrow pointing towards the shared memory there.)

Essentially, what's needed at this stage is that the rendering result is available in two places — in presentation to the screen, and also very importantly in intermediate fullscreen copying and post-processing passes (fullscreen rectangles must be functional), and also that orientation in 3D worlds in games in possible.

Stage 2: Direct3D 12 parity

These tasks need to be done to achieve the compatibility level and performance that will be on par with the Direct3D 12 backend on the PC, on the hardware that supports both APIs. This list may include some tasks that are specific to Vulkan, but still must be done to handle the differences between the drivers for feature parity. An example is unorm24 depth, which is somehow implemented and exposed on AMD GPUs on Direct3D, but isn't available there on Vulkan.

Once this stage is done, it may become possible to merge the vulkan branch (as long as Vulkan physical devices not supporting features without fallbacks implemented in Xenia are skipped from the selection). However, since we don't have Xenia working on operating systems other than Windows at all currently, there's no need to rush. I'm not sure what would be better from the news perspective — to have two blog post releases, one about just Vulkan support on the PC, and the other about full device coverage later; or to make the Vulkan release blog post sound way crazier from the very beginning by stating that we have full device variety coverage there.

Stage 3: Vulkan-specific parts

Some of this functionality needs to be implemented to take advantage of what the Vulkan API offers that Direct3D 12 doesn't (such as bit casting between image formats with the same texel size, but different bit layouts), but mostly these tasks must be done for the coverage of 100% possible Vulkan 1.0 (and VK_KHR_portability_subset) implementations, primarily on Android, possibly with slightly reduced performance, and in some cases, with noticeable, but tolerable and not completely game-breaking visual correctness losses.

Bonus stage: Further improvements

This section contains mostly ideas for future research, though these things are highly desirable, they will be highly experimental, may take a long time relatively, and are not absolutely necessary.

halotroop2288 commented 2 years ago

Holy hell, you are a passionate developer and writer. I love to see it. 🤩

eszlari commented 1 year ago

The main functional reason to support both is that on non-deprecated Microsoft's platforms, Direct3D 12 is supported more widely than Vulkan.

In case this isn't known: Microsoft is developing Vulkan-to-D3D12 driver as part of the Mesa project, called "Dozen" (dzn):

https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/microsoft/vulkan

adding support for other GPU APIs, such as Metal

Vulkan-to-Metal:

https://github.com/KhronosGroup/MoltenVK

Wunkolo commented 1 year ago

no filtering of cubemaps across edges depending on a cvar

Just wanted to capture that VK_EXT_non_seamless_cube_map has grown to be pretty ubiquitous.

Triang3l commented 1 year ago

Note about one very specific edge case:

When we're converting depth to float24 directly in pixel shaders, with MSAA, the shader inherently has to run at sample frequency, so each sample outputs its own depth — otherwise anti-aliasing wouldn't work at intersections of polygons.

With memexport (#145), however, this means that memory writes may be done multiple times for the same pixel, which would be causing data races.

In the Direct3D 12 backend, we have a workaround for this that restricts memory export to only the first sample (which is, with Direct3D's standard sample positions, the closest to the center — and for partially covered pixels, it also matches the centroid) of the pixel covered by the primitive — specifically, if (SV_SampleIndex == firstbitlow(SV_Coverage)). This guarantees that if a pixel is covered at least slightly, the memexport code will still run for it — just like it would normally do if the shader was executed at pixel frequency.

This is possible thanks to the fact that in Direct3D, the coverage input contains the primitive coverage regardless of whether the shader runs at pixel or sample frequency. Quoting the Direct3D 11.3 Functional Specification:

16.3.2 Input Coverage This is a bitfield, where bit i from the LSB indicates (with 1) if the current primitive covers sample i in the current pixel on the RenderTarget. Regardless of whether the Pixel Shader is configured to be invoked at pixel frequency or sample frequency, the first n bits in InputCoverage from the LSB are used to indicate primitive coverage, given an n sample per pixel RenderTarget and/or Depth/Stencil buffer is bound at the Output Merger.

Unfortunately, this approach won't work in Vulkan or OpenGL, because with sample shading, the sample mask input includes only the samples corresponding to the specific shader invocation — and in our case, will only contain 1 bit set in each invocation, and thus it can't be used by the shader to elect one per-sample invocation for a pixel. Quoting the Vulkan specification:

SampleMask It has a sample bit set if and only if the sample is considered covered for this fragment shader invocation.

So on Vulkan, it looks like we'll have to fall back to a less precise method — doing the memexport only for the sample at some constant index. This will result in memexport potentially not being done for some pixels, however.

In this case, I think the most preferable choice would be the sample closest to the center in the bottom-right quarter of the pixel. With the standard Vulkan sample locations, it's 0 with native 2x MSAA (and naturally 0 without MSAA), and 3 with 4x MSAA. The reason why the bottom-right sample is preferred is the half-pixel offset. If the game draws a rectangle, which is a good particular use case for memexport (basically running a 2D compute grid), if it has half-pixel offset enabled (by default on the Xbox 360, as it's the usual Direct3D 9 behavior) and draws the rectangle at integer pixel coordinates, the top-left pixel will only have the bottom-right sample covered — and if we choose any other sample, memexport won't be done for that pixel.

However, I've never seen any game doing memexport in pixel shaders so far, let alone doing memexport with MSAA. So I think we shouldn't bother overcomplicating handling of this scenario via some crazy interpolator math, and just do something that will be localized in a small region of code simply to handle that more or less safely, not necessarily perfectly.

There seems to be no way to directly emulate the Direct3D sample mask behavior in Vulkan until a new Vulkan extension is created for this purpose. Note that VKD3D and DXVK are also inherently affected by this issue — the existing check in Xenia will not work on VKD3D, although since we have a native Vulkan backend in development, I personally don't think that we should be introducing VKD3D-specific code in Xenia, rather, we should be fixing the Vulkan specification and the translation layers themselves in such cases so they can achieve higher compatibility with all applications.

Triang3l commented 4 months ago

A small note regarding feature scalability: converting memexporting vertex shaders to compute shaders will be important not only on mobile GPUs, but also on in-development Vulkan drivers for the old AMD Evergreen and Northern Islands (TeraScale 2/3, like HD 5xxx/6xxx) due to the very tight hardware UAV count limitation making it impossible to support vertexPipelineStoresAndAtomics, as well as possibly for Apple Silicon devices because of tiling and especially the complicated interaction of potential vertexPipelineStoresAndAtomics with software-emulated geometry functionality like geometry shaders and transform feedback.

Triang3l commented 4 months ago

Additional to-do: For sparse binding, we need not only BindSparse>Submit semaphore synchronization, but probably Submit>BindSparse too, even within one queue, otherwise the page table may be modified while the previous graphics work (also referencing the same resource, like the shared memory buffer or resolution-scaled data) is still being executed, potentially causing corruptions and GPU hangs. Also, queue family ownership transfers must be done if sparse binding operations are done on a different queue family (or the concurrent sharing mode should be used, but the performance implications of it for buffers are unknown).

hardBSDk commented 3 months ago

@Triang3l What is the current progress of the Vulkan renderer?

Triang3l commented 3 months ago

@Triang3l What is the current progress of the Vulkan renderer?

The first message in the thread describes the current state. I've just updated it to reflect that memory export is now supported, though without most downlevel device fallbacks yet.

PatrickvL commented 3 months ago

On the topic of texture format conversion and caching:

For any guest resource (of any type) that is converted into a host resource with a different size, the original guest size is still what should be used for accurately emulating guest cache eviction. (Perhaps this is already implemented, in which case I'm fine with deleting this remark.)

insaneninja117 commented 3 months ago

Would there be any preferable way to assist with debugging Vulkan issues via testing? For example RenderDoc or Xenia.log submissions? One big issue of the new renderer versus the old one is constant crashing with the windows error dialog citing the amdvlk64.dll as the issue. The fault code is always the same so there is something this is doing that the AMD drivers do not like whilst the previous rendition would fail (if it did) for other reasons.