Triang3l commented 2 years ago

For GPU emulation on GNU/Linux and Android, as well as Windows 8, and Windows 7 until 12on7 is integrated into Xenia, it's necessary to use either OpenGL or Vulkan.

OpenGL and OpenGL ES can be considered out of the question for various reasons, such as long GLSL compilation times, difficulties in caching and restoring combinations of shaders and states (state application may be deferred until a draw call) to prevent stuttering, extremely hacky multithreading making decoupling of GPU emulation and UI thread presentation nearly impossible. So, a Vulkan-based GPU emulation implementation is needed.

The quick route, running the Direct3D 12 backend on VKD3D, currently isn't functional due to what appears to be bugs in VKD3D (such as geometry stage linkage errors) and because of certain unimplemented features (like 32x32=64 integer multiplication in shaders that we use for integer division by a constant, in EDRAM tile addressing). We most definitely must not be trying to work around those issues in our code, as that will harm both Xenia (as hacks will be added to the code) and VKD3D, which will lose a test case for certain relatively obscure parts of Direct3D 12. At some point, those issues should be fixed inside VKD3D.

However, even if VKD3D was feature-complete enough for Xenia to work on it, it will still be able to provide only GNU/Linux support on the PC to us. VKD3D requires a Vulkan implementation that supports everything that's mandatory in Direct3D 12, and also makes use of Vulkan extensions designed specifically for it, such as mutable-type descriptors and swizzled 4:4:4:4 formats. Android is a completely different world, however. While the Direct3D 12 API requires the hardware to support at least the Direct3D feature level 11_0 (plus descriptor indexing and SRV swizzle), the baseline requirements of Vulkan match OpenGL ES 3.1, which are somewhat above, somewhat below those of the Direct3D feature level 10_0 — and it seems like there are simply no optional Vulkan features that are supported universally by all Vulkan implementations on Android.

While porting Xenia to just the PC subset of Vulkan implementations is straightforward, what will make this project fun and challenging is providing fallbacks for all optional features, so most games can be played on Xenia even on the base Vulkan 1.0 implementation with no optional features and the lowest possible values of the limits, and also with all VK_KHR_portability_subset features disabled. There seems to be nothing that completely blocks the core architecture of Xenia's GPU emulation from running on the base Vulkan 1.0 configuration, fallbacks just need to be written.

Note that supporting Vulkan does not mean dropping Direct3D. I'm planning to support both the Direct3D 12 and the Vulkan backends equally actively at least for as long as I'm working on Xenia. The main functional reason to support both is that on non-deprecated Microsoft's platforms, Direct3D 12 is supported more widely than Vulkan. Configurations that support Direct3D 12, but not Vulkan, include the UWP on the Xbox One and the Xbox Series (the UWP on the Xbox One also requires legacy DXBC shaders, not DXIL), Windows devices on Qualcomm's SoCs (which will naturally become one of the targets once Arm CPUs are supported by Xenia in general), Windows 10/11 on the Nvidia Fermi architecture (Vulkan is supported only starting with Kepler) and on Intel HD Graphics 4200+ (no Vulkan support until some generation of the Intel UHD Graphics — and, as far as I know, the Intel HD Graphics drivers require DXBC shaders, though I'm not entirely sure). Also, one of the backends can be used as a backup one if there are issues with another (primarily when driver bugs are encountered). Certain new features are added more quickly to the Direct3D 12 API than to Vulkan. For Xenia, the most prominent example is the rasterizer-ordered views / fragment shader interlock used for high-accuracy, strict emulation path (which was the default before the host render target emulation path redesign) for the framebuffer logic of the Xenos with all its custom pixel formats. This is a feature that existed in Direct3D 12 from its very beginning, but Vulkan has received it only in 2019 — when Xenia has already been heavily utilizing it for half a year. And even now, because Vulkan doesn't have a concept of feature levels with mandatory features unlike Direct3D, AMD refuse to implement this functionality on Vulkan and OpenGL, even though it already works on their hardware on Direct3D and Metal. Additionally, to keep the project maintainable while supporting multiple GPU APIs, we need to place as much code as we can in API-agnostic classes rather than in API-specific implementations — something I haven't really been doing while working on the current architecture of GPU emulation. This will mean that adding support for other GPU APIs, such as Metal, maybe even WebGPU in case we decide to bring Xenia to the web if web browsers provide sufficient functionality, for instance, for thread synchronization, will become more straightforward in the future. As a side note, Metal will also benefit from our 100% Vulkan device coverage approach as it doesn't expose, in particular, geometry shaders.

~~The porting work is done on the vulkan branch, and it can be separated into multiple stages, each with their core goal and a checklist of things that need to be implemented.~~ These don't have to be done strictly in order though as they don't depend on each other as a whole, ~~and the Vulkan development branch won't be merged into the main one until the Vulkan backend is at least on par with Direct3D 12 anyway~~ (update: it has been merged when it has become functionally roughly on par with the previous Vulkan backend for ease of maintenance).

Stage 1: From draw commands to presentation

At the first stage, the emulator should be made interactive so testing can be done much more rapidly than by just taking RenderDoc captures after blindly doing some sequence of button presses.

To achieve that, we need to get all parts of the frame architecture, displayed below, of Xenia to be functional at least at some basic level required to pass data between the stages, but disregarding details of the subsystems themselves.

The lines in the graph reflect the current state of the data flow in the Vulkan backend in Xenia — thick lines represent the data paths already fully implemented in Xenia on Vulkan (though the subsystems themselves that they link may still be in a very early state).

flowchart TD
    PrimitiveProcessor[Primitive processor]
    SharedMemory[Shared memory]
    TextureCache[Texture cache]
    DrawPipeline[Draw command pipeline]
    RenderTargetCache[Render target cache]
    Resolve[Render target to texture resolve]
    Presentation[Presentation]

    PrimitiveProcessor ==>|&gt Topology, converted index buffers| DrawPipeline

    SharedMemory ===|&lt Tiled framebuffer texture data| Resolve

    SharedMemory ===|&gt Index and vertex buffers<br/>&lt Memory export| DrawPipeline
    SharedMemory ==>|&gt Guest texture data| TextureCache
    TextureCache ==>|&gt Host textures| DrawPipeline

    DrawPipeline ==>|&gt Fragment depth/stencil and color| RenderTargetCache

    RenderTargetCache ==>|&gt Render target data| Resolve

    TextureCache =====>|&gt Host frontbuffer texture| Presentation

style PrimitiveProcessor fill:PaleGreen,stroke:Green
style SharedMemory fill:PaleGreen,stroke:Green
style TextureCache fill:PaleTurquoise,stroke:Turquoise
style DrawPipeline fill:PaleTurquoise,stroke:Turquoise
style RenderTargetCache fill:PaleTurquoise,stroke:Turquoise
style Resolve fill:PaleTurquoise,stroke:Turquoise
style Presentation fill:PaleTurquoise,stroke:Turquoise
linkStyle 0 stroke:Lime
linkStyle 1 stroke:Lime
linkStyle 2 stroke:Lime
linkStyle 3 stroke:Lime
linkStyle 4 stroke:Lime
linkStyle 5 stroke:Lime
linkStyle 6 stroke:Lime
linkStyle 7 stroke:Lime

(The link between render target resolving and the shared memory should be a backward one, but Mermaid places the shared memory in the middle if such a link is created — assume that there's an arrow pointing towards the shared memory there.)

Essentially, what's needed at this stage is that the rendering result is available in two places — in presentation to the screen, and also very importantly in intermediate fullscreen copying and post-processing passes (fullscreen rectangles must be functional), and also that orientation in 3D worlds in games in possible.

Submission:
- [x] Command buffer submission with synchronization and last resource usage tracking.
Shared memory:
- [x] 512 MB buffer uploads.
Shaders:
- [x] Uniform buffers (guest floating-point, boolean, loop and fetch constants; Xenia's internal constants).
- [x] Vertex index endian swapping.
- [x] PsParamGen pixel shader input, at least the pixel position.
- [x] Control flow and ALU.
  - [x] Fragment discarding is currently implemented as OpKill, which is a control flow instruction in SPIR-V, but not on the Xenos — need to use OpDemoteToHelperInvocation (or, if not supported, a custom flag and OpKill in the end) instead to make sure derivatives are calculated correctly if any fragment is killed.
- [x] 512 MB shared memory binding.
- [x] Vertex fetch.
- [x] Texture bindings.
- [x] Basic texture fetch.
- [x] Position output, including VTX_FMT handling and viewport adjustments.
- [x] Interpolants.
- [x] Color output.
Fixed-function stages:
- [x] Viewport and scissor.
- [x] Rectangle list geometry shader.
- [x] Rasterization state.
- [x] Depth/stencil state.
Texture cache:
- [x] Common (Xenia Shading Language — XeSL) texture load shaders.
- [x] Texture object creation and data loading.
- [x] Texture views with swizzle.
- [x] Texture descriptor set writing.
Render target cache:
- [x] Render target creation.
- [x] Render passes and framebuffers.
- [x] Basic EDRAM data reinterpretation (range ownership transfers) between host render targets.
- [x] Host-precision float32 data loading in the transfers for float24 destinations if it's up to date (gives the same 24 bits as in the source when converted from 32 bits to 24).
Render target to texture resolve:
- [x] Render target to EDRAM buffer dumping shader generation.
- [x] Common (XeSL) resolve shaders.
- [x] Resolving to the shared memory.
- [x] Clearing render targets after resolving.
Presentation:
- [x] Basic frontbuffer output.

Stage 2: Direct3D 12 parity

These tasks need to be done to achieve the compatibility level and performance that will be on par with the Direct3D 12 backend on the PC, on the hardware that supports both APIs. This list may include some tasks that are specific to Vulkan, but still must be done to handle the differences between the drivers for feature parity. An example is unorm24 depth, which is somehow implemented and exposed on AMD GPUs on Direct3D, but isn't available there on Vulkan.

Once this stage is done, it may become possible to merge the vulkan branch (as long as Vulkan physical devices not supporting features without fallbacks implemented in Xenia are skipped from the selection). However, since we don't have Xenia working on operating systems other than Windows at all currently, there's no need to rush. I'm not sure what would be better from the news perspective — to have two blog post releases, one about just Vulkan support on the PC, and the other about full device coverage later; or to make the Vulkan release blog post sound way crazier from the very beginning by stating that we have full device variety coverage there.

Setup:
- [ ] Fallback chain for GPU emulation backends, so that if one doesn't have any compatible physical devices, a different one is chosen, and falling back to software rendering on any of the backends.
Submission:
- [x] Cache clearing.
- [ ] Frame tracing with all the needed downloads of GPU-generated data.
- [ ] RenderDoc programmatic capture.
- [x] Memory barrier batching.
Shared memory:
- [x] Sparse binding of the actually used portions of the buffer.
  - Verification is needed once the emulator can be debugged visually — RenderDoc sometimes displays that zeros are read from the buffer instead of the real data when sparse binding is enabled.
- [ ] Persistent 512 MB storage buffer descriptor (if supported) for texture load and resolve shaders to eliminate some descriptor writes.
Shaders:
- [ ] Multithreaded pipeline creation within a frame.
- [ ] Persistent guest shader and host pipeline creation info cache (move the D3D12PipelineCache logic to the common code) in the shareable folder.
- [ ] Persistent host pipeline binary cache in the local folder, especially if the driver doesn't have an internal cache on the PC (even more likely on mobile). May need to handle versioning of the shader translator logic itself by storing the hashes of the translated shaders, translating them while loading again using the up-to-date translator code, and comparing the hashes, if Vulkan doesn't handle this safely behind the scenes (research is needed).
- [x] Configuring VK_KHR_shader_float_controls to match the behavior of the Xenos (rounding to the nearest even most likely, flushing denormals, preserving the sign of zero and specials).
- [x] Advanced texture fetch instructions, including signedness and PWL gamma handling.
  - [ ] Limit the number of texture and sampler bindings in a shader to the device limit.
- [x] Memory export.
- [ ] User clip plane clipping via ClipDistance output.
- [ ] Vertex discarding ("AND" via the cull distance output, "OR" by setting the position to NaN).
- [x] Alpha test.
- [ ] Per-sample alpha to coverage.
- [x] Color exponent biasing.
- [x] 16_16 and 16_16_16_16 color output scaling by 1/32.
- [x] PWL gamma conversion on output.
- [ ] Depth output.
- [ ] Float24 depth conversion directly in shaders (need to be careful with regard to sample-rate shading support).
- [x] Fragment shader interlock render backend implementation.
Fixed-function stages:
- [x] Quad list from line list with adjacency geometry shader.
- [x] Point sprite geometry shader.
- [ ] Tessellation.
- [x] Independent blending state where supported.
  - [ ] Checking whether the format is blendable for safety and disabling blending if it's not.
Texture cache:
- [ ] VK_EXT_robustness2 null image descriptors instead of single-pixel images where supported.
- [ ] Continuing using the latest texture descriptor sets if texture bindings haven't been changed.
- [ ] Sampler allocation, to stay within maxSamplerAllocationCount, minus the number of the host samplers in the VulkanProvider, minus some space (probably 16) for external purposes such as third-party overlay layers. If this value is exceeded, should await the completion of submissions that have referenced the least recently used samplers until enough free samplers are available. The limit may be exceeded within one submission as well, in this case the current submission must be ended, and all pending submissions must be awaited. Since awaiting triggers CompletedSubmissionUpdated of all subsystems that associate resource usage with submission indices that may result in destruction of the resources used in the last (the current one that was split, to be more precise) submission, and because such destruction may happen even within per-draw subsystem updates (not necessarily in CompletedSubmissionUpdated — the sampler allocator itself is an example of this), all subsystems that have been updated for the draw that triggered the splitting must be updated again. For this reason, it's better to update the textures as early in the draw as possible, before other subsystems, so if splitting the submission is necessary, the only part that will need to be updated after switching to the new submission is the texture cache itself. Including sampler features provided by extensions (no filtering of cubemaps across edges depending on a cvar, custom border for video border colors — just don't care about samplers exceeding the custom border color sampler limit, make them fall back to 0, 0, 0, 0, since it's a very rare situation).
- [x] Suballocation of textures from larger memory allocations due to maxMemoryAllocationCount usually being low on Windows Vulkan implementations (using the Vulkan Memory Allocator at least initially).
- [ ] Resolution scaling, including the texel size adjustments in texture fetch instructions.
- [ ] Respecting optimalBufferCopyRowPitchAlignment and (except for the mip tail) optimalBufferCopyOffsetAlignment when loading textures.
Render target cache:
- [x] Host-precision float32 data loading in the transfers for unorm24 destinations if it's up to date (gives the same 24 bits as in the source when converted from 32 bits to 24), because PC AMD GPUs don't support unorm24 depth on Vulkan.
- [x] Unorm24 depth host support check.
- [x] Writing stencil via the stencil reference shader export in transfers if VK_EXT_shader_stencil_export is supported.
- [ ] Option for approximation of PWL gamma as sRGB for linear color space blending.
Presentation:
- [x] Gamma ramp application.
- [ ] FXAA.

Stage 3: Vulkan-specific parts

Some of this functionality needs to be implemented to take advantage of what the Vulkan API offers that Direct3D 12 doesn't (such as bit casting between image formats with the same texel size, but different bit layouts), but mostly these tasks must be done for the coverage of 100% possible Vulkan 1.0 (and VK_KHR_portability_subset) implementations, primarily on Android, possibly with slightly reduced performance, and in some cases, with noticeable, but tolerable and not completely game-breaking visual correctness losses.

Setup:
- [ ] Object debug names.
Submission:
- [ ] Estimating the actual extents (the same way as in the render target cache) of the framebuffer used in the render pass and patching its renderArea in deferred command buffer.
- [x] Restarting the render pass without attachments if the sample count becomes different when the variableMultisampleRate feature is not supported.
Primitive processor:
- [x] Quad to triangle conversion when geometry shaders are not supported.
- [x] Triangle fan conversion when the triangleFans portability subset feature is not supported.
- [x] Endian pre-swapping during the conversion of guest index buffers into temporary ones (not stored in the shared memory buffer, thus not randomly accessible from the shader) when changing the primitive type or normalizing the primitive restart index to restrict the indices to 24 bits so they can be read without the fullDrawIndexUint32 feature.
Shaders:
- [x] Limit descriptor set count to 4 (the minimum requirement of Vulkan, and also the limit on Qualcomm Adreno 6xx): 1 for static bindings (shared memory, EDRAM with fragment shader interlock), 1 for uniform buffers, 1 for pixel shader textures and samplers, 1 for vertex shader textures and samplers.
- [x] 4 x 128 MB or 2 x 256 MB shared memory bindings for small maxStorageBufferRange values.
  - 4 bindings is exactly the minimum maxPerStageDescriptorStorageBuffers value — need to be careful, though so far only the EDRAM storage buffer (though possibly will need to be a texel buffer instead if on the PC, large texel buffers are more commonly supported than 5+ storage buffers in the fragment shader stage) in the fragment shader interlock render backend implementation, and in the future possibly the sample counters in it, may cause this limit to be exceeded.
- [ ] Moving all linkage-related parameters (whether the point size is written, whether the vertex kill flag is written and how it should be treated, the mask of the used clip or cull planes handled via the fixed-function clipping or in the fragment shader, the mask of the user interpolants in both the vertex and the pixel shaders excluding the PsParamGen register, centroid interpolations specifiers, whether PsParamGen is needed, the interpolant index to replace with PsParamGen, whether PsParamGen should contain the point coordinates and have the point flag set) to the translation modifications (with the appropriate checks while loading the persistent pipeline cache in case it has been created on a device with different capabilities to skip the incompatible pipelines), to stay within all the related physical device limits (not only in the translated guest shaders, but in the geometry shaders as well), such as the input and output component count, the clip and cull distance counts, also not to depend on the shaderSampleRateInterpolationFunctions portability subset feature.
- [ ] Disabling memory export in fragment shaders, and skipping draws using vertex fetch in the fragment shader, if maxFragmentCombinedOutputResources is exceeded (at the minimum requirement of 4, for example, has any color outputs, plus four 128 MB shared memory bindings, or three color outputs and two 256 MB shared memory bindings, or four color outputs and one 512 MB shared memory binding).
- [ ] Manual vertex index fetch for 32-bit indices and adaptive tessellation factors when the fullDrawIndexUint32 feature is not supported, as well as in memory-exporting vertex shaders converted to compute.
- [x] Manual texture component swizzling if the imageViewFormatSwizzle portability subset feature is not supported.
- [x] Disabling memory export if shader writes are not supported in the stage where it's used.
- [ ] Conversion of vertex shaders doing memory export into compute when vertex shader memory writes are not supported.
  - This also needs to be done if emulating point sprites or rectangles without geometry shaders, and vertex fetch can be done for data potentially exported by other vertex shader invocations, which may end up in a different wavefront, and maybe quads and triangle fans converted to other primitive types as well, as the same index may be in multiple primitives, and hardware vertex reuse isn't guaranteed. However, usually memory export is used specifically with the point primitive type (vertex shaders functioning as simple "compute shaders"), so only the point sprite expansion case is truly important.
- [x] Point sprites as triangle strips when geometry shaders are not supported, by drawing 4 time more vertices, executing the guest shader for vertex index / 4, expanding based on vertex index % 4.
  - Whether to use instancing with one strip per instance, or just a static index buffer with primitive restart every 4 vertices, depends on whether all hardware supports vertices from different instances in a single wavefront — needs research. An index buffer seems to be a safer option, however.
- [ ] Rectangles as triangle strips when geometry shaders are not supported, by executing the guest shader 3 times in every invocation, and choosing the order of the vertices in the strip based on which edge is the longest — for the first three vertices, to select which set of guest vertex exports to use for the host vertex, and for the fourth, which sign each of the three vertices will contribute with. Memory export should be performed only if the host vertex index in the rectangle equals the guest vertex index being processed in the loop (and thus also never for the fourth), so it's done only once for each guest vertex. See https://github.com/xenia-project/xenia/issues/1213#issuecomment-1125208037 for the shader structure.
- [ ] Per-sample (if the sample mask output is supported — possibly depends on the presence of the sampleRateShading feature, however, the SampleRateShading capability requirement was added in SPIR-V 1.0 revision 3 and removed in SPIR-V 1.0 revision 12 — needs research, though glslang doesn't enable SampleRateShading when using gl_SampleMaskIn or gl_SampleMask when targeting SPIR-V 1.0, or SPIR-V 1.1 which incorporated some changes from 1.0 revisions between 3 and 12, though older drivers may be implementing the revisions where SampleRateShading is required, so likely we should be assuming the worst) and per-pixel (otherwise) user clip plane clipping in pixel shaders for clip planes beyond maxClipDistances and maxCombinedClipAndCullDistances using custom clip distance interpolants.
  - User clip plane culling is conservative (clipping only if all vertices in the primitive are behind the plane) and is mostly used as an optimization — doesn't need to be emulated if cull distances are not supported (emulating via per-sample clipping would be a heavy pessimization).
- [ ] Per-pixel alpha to coverage if the sample mask output is not supported.
- [ ] Skipping draws exceeding max*InputOutputComponents — investigate whether gl_PerVertex is included in this, but also may be caused by the point size value for the point geometry shader, and by the custom clip distance interpolants for clipping in the fragment shader (but not by the point coordinates since they may be using the interpolant slot that will be replaced with PsParamGen and thus may be excluded).
- [ ] Unorm24 depth conversion directly in shaders (need to be careful with regard to sample-rate shading support) when true unorm24 depth is not supported by device — only the round-to-nearest mode (not supporting conservative early depth) unlike for float24 because unorm24 depth writing is used for color clears, which must be exact, otherwise the value may leak into a different component (0x00FF to 0x0100 or vice versa).
Fixed-function stages:
- [ ] Tessellation via pre-tessellated patches (static for the discrete mode, vertex buffers for rounded factors plus dynamic adjustment in the vertex shader for the fractional part for continuous, treating adaptive as continuous with the constant value that equals to the maximum tessellation factor) where it's not natively supported, with the control point indices fetched based on the instance index.
  - Pre-tessellated patches may be useful in all cases for the discrete mode, as its layout differs from the Direct3D 11 integer tessellation (uniform rather than concentric).
- [x] Forcing the line fill mode when the pointPolygons portability subset feature is not supported.
- [x] Forcing the solid fill mode when the fillModeNonSolid feature is not supported.
- [x] Simple fallback to the constant color factor instead of the constant alpha factor when the constantAlphaColorBlendFactors portability subset feature is not supported.
- [ ] When the constantAlphaColorBlendFactors portability subset feature is not supported, if any render target uses the constant alpha factor for the color, but no render targets need the constant color factor, replace the constant color with the constant alpha.
  - This is also needed for Direct3D 12, where the constant alpha factor is not supported for the color.
- Independent blending and write masks when native independent blending state is not supported. Unlike on Direct3D, not writing to an output in the fragment shader results in undefined behavior rather than in skipping of writing to that attachment (see CoreValidation-Shader-InputNotProduced). So, when the set of the color render targets that need to be written to is changed, the render pass must be restarted with a framebuffer that doesn't use them — no accumulation of the render target bindings should be in the render target cache if independent blending state is not supported (the only exception is toggling between color/depth and depth-only draws, a case common in Banjo-Kazooie — the color bindings may be kept for the depth-only draws, since all color targets will have the write mask of 0). Two options switchable with a configuration variable:
  - [ ] The fast, but incorrect workaround, making blending non-independent — reuse of the blend state and the write mask for the first used render target (or a union of the write masks for all targets).
    - Partially implemented already, but with an incorrect assumption that Vulkan behaves like Direct3D 11+ when not writing to a fragment shader output. All references to the exclusion of unused render targets from fragment shaders should be removed as this approach won't work according to the specification.
  - Maybe — the exact, but expensive (restarting the render pass mid-draw) emulation — execute the guest draw as multiple host draws with different framebuffer objects and render passes, one for each set of unique blending and color write mask settings used for one or multiple render targets in the draw (if all currently bound render targets actually use the same blending state, this must not be causing render pass restarts). A faster approach would be to replay not draws alone, but entire render passes, for different bound render targets. However, the depth/stencil test won't behave the same between the per-render-target draws as it would in one draw, as the first sub-draw will write the new depth and stencil that will be used in the second instead of the original values. Depth/stencil writes in draws other than the last also cannot be disabled, because geometry in draws may self-overlap, and the write must occur to let one primitive in the sub-draw occlude another one in the same sub-draw. With per-render-pass splitting rather than per-draw, however, it should be possible to preserve the contents of the depth/stencil buffer before the first execution of the render pass and to restore the saved state before replaying, but that's very expensive (and that can't be avoided in pretty much all cases when blending is used in combination with depth/stencil writes — if 2 translucent, but depth-/stencil-writing, layers were drawn in the first execution of the render pass, we must not keep only 1 of those layers in the execution for the second color target, since unlike in the opaque case, both layers contribute to the final color; stencil writes may also affect subsequent stencil tests in vastly different ways depending on the stencil configuration). Occlusion query sample counting, and memory export, should be done only for one of the sub-draws.
  - In both cases, the alpha blending state of the render targets that don't have an alpha channel in their format should be disregarded — render targets without alpha should inherit the alpha blending state of any render target that actually has alpha.
Texture cache:
- [ ] Skipping the mips that exceed the maximum dimensions supported by the device. The Xenos supports 8192x8192 or 2048x2048x1024, while Vulkan requires only at least 4096x4096 or 256x256x256. Forza Motorsport 4 uses a 8192-wide texture for the sky, for example. The getCompTexLOD and setTexLOD shader instructions should take this into account.
  - May be useful regardless of the GPU API, to make textures without the base mip (very common in games with texture streaming — the Xenos specifically allows specifying separate base and mip addresses) not use any host memory for it.
  - [ ] For 3D textures with the width and the height exceeding the limit, alternatively, conversion to a 2D array may be done, primarily for those that don't have mips, since linear filtering of texture array layers (which is possible on the Xenos — used for the color grading LUT in Burnout Revenge — but not natively on the PC) is already done in translated shaders. Tales of Vesperia has its subtitle font in a 3D texture with 512x512 layers, for instance. If the texture has mips of a size supported by the device, a configuration variable can possibly be provided to choose what to do with them — prefer the size and convert to a 2D array, or prefer the full filtering logic and drop the high-resolution mips.
- [x] 128 threads per group in the texture load shaders (the minimum requirement of Vulkan).
- [ ] Switching between storage buffer descriptors if maxStorageBufferRange would be exceeded. Tiled 2D texture memory can be split into disjoint regions with 128x128-block granularity for 1bpb, 64x64 for 2bpb, or 32x32 for 4bpb and above. Tiled 3D textures appear to have every 4 slices in separate memory regions (though addressing repeats every 8 slices, so the dispatch should be relative to the base slice and a negative base address).
- [x] GBGR8 and BGRG8 decompression when the subsampled formats are not supported or when the size is odd.
  - [x] Also needed on Direct3D 12 for 4:2:2 textures with an odd width.
- [ ] Forcing the filter to nearest-neighbor in host samplers in case the texture (or, if both the unsigned and the signed views are used, the two image views) they're used for is not filterable on the host.
- [ ] When linear-filterable VK_FORMAT_A2B10G10R10_SNORM_PACK32 is not supported, convert signed normalized 2_10_10_10 textures to 16-bit normalized (if supported — for more uniform internal filtering precision) or to 16-bit floating-point.
- [ ] When linear-filterable 16-bit normalized textures are not supported, convert 10_11_11 and 11_11_10 textures to 16-bit floating-point (though keep the 16-bit normalized option for more uniform internal filtering precision). In the [0.5, 1) range, float16 has one implicit mantissa bit for the 0.5, and 10 explicit mantissa bits — the precision will not be lower than that of unorm11.
- [x] When 16-bit normalized textures are not supported, lossily convert them to 16-bit floating point.
  - [ ] If 16-bit normalized textures are supported, but are not linear-filterable, expose a configuration variable for selecting between precision (use normalized 16-bit) and filterability (use 16-bit floating-point) when choosing the fallback format.
    - Additionally, a field similar to signed_separate may be added to the texture key for choosing between the filterable and the non-filterable host format (if they are different) depending on whether the guest (via the fetch constant, or via the overrides in the current shaders) actually requests filtering for the texture, however, this is a large overcomplication.
- Unsupported S3TC fallback options with reduced memory usage:
  - [ ] Decompression to 5.5.5.1 or 4.4.4.4 (2x smaller than RGBA8, 4x bigger than DXT1, 2x bigger than DXT2–5).
  - [ ] Skipping one mip (4x smaller than full RGBA8, 2x bigger than DXT1, same size as DXT2–5).
  - [ ] Conversion to ETC2/EAC (same size as the original textures). A huge open-ended research topic. We totally don't want to fully decompress and recompress the textures as that takes an extremely long time while this needs to be done at runtime immediately for the first frame using the texture. Rather, as much reuse of the existing interpolation factors as the intensity modifier indices as possible is hugely desirable.
- [ ] When S3TC is not supported, and using a memory-intensive fallback (decompression of full-size color textures to 8.8.8.8, or to a 16-bit format, or of full-size DXT5A or DXN textures to 8 or 8.8 respectively), raise the host texture memory usage limit above which the least recently used textures start being deleted proportionally.
Render target cache:
- [x] Performing transfers to multisampled destinations with 2 or 4 draws via the sample mask and a specialization constant if the sampleRateShading feature is not supported.
- [ ] If the imageViewFormatReinterpretation portability subset feature is supported, option (since this may be harmful to framebuffer compression, which Xenia can possibly take advantage of at least for 8_8_8_8 and 2_10_10_10) for reuse of guest render target images with 1:1 guest–host bit encoding mapping within Vulkan format compatibility classes (32bpp — 8_8_8_8, 2_10_10_10, 16_16_FLOAT, 32_FLOAT, optionally 16_16 if not lossily emulated as float; 64bpp — 16_16_16_16_FLOAT, 32_32_FLOAT, optionally 16_16_16_16) to avoid ownership transfers if only changing the format.
- [ ] VK_KHR_image_format_list for render target images.
- [ ] Emulation of 16_16 and 16_16_16_16 render targets as floating-point when the signed normalized formats are not supported (or not blendable — expose this as a configuration variable), including in resolve clears. This is lossy, and redundant EDRAM range ownership transfers will corrupt the data if they involve data of other formats that needs to be preserved. However, because unclipped screen-space draws now have their vertex shader executed on the CPU to calculate accurate extent of them, other render targets now won't be destroyed (otherwise, logic similar to the synchronization of 24-bit guest and 32-bit host depth would need to be used, but it'd heavily overcomplicate the transfer shader address calculation logic — require it to be done twice — that is already highly tangled).
- Fallbacks for multisampled sources and destinations in transfers with formats that have ambiguous values (two -1 encodings in signed normalized formats, floating-point denormals and NaN encodings) when multisampled integer sampled images are not supported. In order of favorability, from better to worse:
  - [ ] 16_16[_16_16] (when it's the native VK_FORMAT_R16G16[B16A16]_SNORM) and 16_16[_16_16]_FLOAT as VK_FORMAT_R16G16[B16A16]_UNORM if supported.
  - [ ] 32bpp formats (16_16, 16_16_FLOAT, 32_FLOAT) as VK_FORMAT_R8G8B8A8_UNORM if the imageViewFormatReinterpretation portability subset feature is supported.
  - [ ] Using the original formats for transfers directly assuming that data inheritance is unlikely except for cases like clears, thus bit values not representable by the formats don't need to be preserved (especially with CPU-side vertex shader execution for unclipped draws to estimate their extents, redundant transfers generally shouldn't be happening), but those that are representable still should be kept so transfer round trips from such a format back to it don't ruin the values (access 16_16_FLOAT as VK_FORMAT_R16G16_SFLOAT, and 32_FLOAT as VK_FORMAT_R32_SFLOAT, but don't cross-cast them since they have different denormal and NaN ranges — zero red would also flush the green to zero, for instance, if denormal flushing is performed, and 16_16_FLOAT data is interpreted as 32_FLOAT).
Render target to texture resolve:
- [ ] Switching between storage buffer descriptors if maxStorageBufferRange would be exceeded in the resolve shaders.
Presentation:
- [ ] Red and blue swapping when the imageViewFormatSwizzle portability subset feature is not supported.
- [ ] Pre-rotated swap chain.

Bonus stage: Further improvements

This section contains mostly ideas for future research, though these things are highly desirable, they will be highly experimental, may take a long time relatively, and are not absolutely necessary.

Shared memory:
- Barrier and render pass breaking elimination via read and write range overlap analysis.
Shaders:
- Control flow structuring and mem2reg for registers to use static single assignment dependencies instead of variables, for better SPIR-V and DXIL generation, detection of actually used interpolants in the pixel shader, and also for elimination of paths towards exports deleted from shaders for internal needs (such as interpolants not used on the opposite end, irrelevant exports when memory export is not supported and when emulating it via conversion to compute shaders).
  - Host texture and sampler bindings must not be deleted from the descriptor set layouts in this case (possibly gathered in a pre-translation pass rather than while processing the instructions that weren't eliminated, though the actual variable may still be skipped from the generated SPIR-V), so the pipeline layout still depends only on the guest shaders and not on their translation modifications, however, as one guest draw may be implemented via multiple host draws (such as when converting vertex shaders doing memory export to compute if vertex shader memory stores aren't supported), and reconfiguring the bindings would require redoing most of the draw call handling for each modification, including the samplers, which need breaking and awaiting the completion of the submission in case of sampler count overflow within a single submission.
- Manual blending for color formats not available on Vulkan on mobile GPUs via VK_ARM_rasterization_order_attachment_access.
Texture cache:
- Resolution scaling without sparse binding, via two buffer bindings and dispatch splitting similar to that used to stay within maxStorageBufferRange.
  - Also needed for overall shared memory emulation in a potential Direct3D 10 (11 feature level 10_0 or 10_1) backport — Direct3D 10 requires at least exactly 128 MB resources, without the mandatory raising of the lower bound depending on the actual dedicated VRAM amount unlike in Direct3D 11 and 12. Aside from this, four shared memory SRVs will be needed for guest rendering on Direct3D 10 in case of a failure to create a single buffer, and also manual fetch of index buffers crossing the 128 MB boundaries, plus possibly additional buffer copying when resolving to a range in different 128 MB buffers in cases when splitting dispatches isn't possible due to the 1 UAV limitation there.
Render target cache:
- Eliminating redundant render pass breaking especially during EDRAM range ownership transfers (if independent write masks are supported in the specific case for writing to just one of the current render target bindings, placing all transfers without dependencies between destinations in the same render pass, performing the final transfers directly in the guest render pass if possible).

halotroop2288 commented 2 years ago

Holy hell, you are a passionate developer and writer. I love to see it. 🤩

eszlari commented 1 year ago

The main functional reason to support both is that on non-deprecated Microsoft's platforms, Direct3D 12 is supported more widely than Vulkan.

In case this isn't known: Microsoft is developing Vulkan-to-D3D12 driver as part of the Mesa project, called "Dozen" (dzn):

https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/microsoft/vulkan

adding support for other GPU APIs, such as Metal

Vulkan-to-Metal:

https://github.com/KhronosGroup/MoltenVK

Wunkolo commented 1 year ago

no filtering of cubemaps across edges depending on a cvar

Just wanted to capture that VK_EXT_non_seamless_cube_map has grown to be pretty ubiquitous.

Triang3l commented 1 year ago

Note about one very specific edge case:

When we're converting depth to float24 directly in pixel shaders, with MSAA, the shader inherently has to run at sample frequency, so each sample outputs its own depth — otherwise anti-aliasing wouldn't work at intersections of polygons.

With memexport (#145), however, this means that memory writes may be done multiple times for the same pixel, which would be causing data races.

In the Direct3D 12 backend, we have a workaround for this that restricts memory export to only the first sample (which is, with Direct3D's standard sample positions, the closest to the center — and for partially covered pixels, it also matches the centroid) of the pixel covered by the primitive — specifically, if (SV_SampleIndex == firstbitlow(SV_Coverage)). This guarantees that if a pixel is covered at least slightly, the memexport code will still run for it — just like it would normally do if the shader was executed at pixel frequency.

This is possible thanks to the fact that in Direct3D, the coverage input contains the primitive coverage regardless of whether the shader runs at pixel or sample frequency. Quoting the Direct3D 11.3 Functional Specification:

16.3.2 Input Coverage This is a bitfield, where bit i from the LSB indicates (with 1) if the current primitive covers sample i in the current pixel on the RenderTarget. Regardless of whether the Pixel Shader is configured to be invoked at pixel frequency or sample frequency, the first n bits in InputCoverage from the LSB are used to indicate primitive coverage, given an n sample per pixel RenderTarget and/or Depth/Stencil buffer is bound at the Output Merger.

Unfortunately, this approach won't work in Vulkan or OpenGL, because with sample shading, the sample mask input includes only the samples corresponding to the specific shader invocation — and in our case, will only contain 1 bit set in each invocation, and thus it can't be used by the shader to elect one per-sample invocation for a pixel. Quoting the Vulkan specification:

SampleMask It has a sample bit set if and only if the sample is considered covered for this fragment shader invocation.

So on Vulkan, it looks like we'll have to fall back to a less precise method — doing the memexport only for the sample at some constant index. This will result in memexport potentially not being done for some pixels, however.

In this case, I think the most preferable choice would be the sample closest to the center in the bottom-right quarter of the pixel. With the standard Vulkan sample locations, it's 0 with native 2x MSAA (and naturally 0 without MSAA), and 3 with 4x MSAA. The reason why the bottom-right sample is preferred is the half-pixel offset. If the game draws a rectangle, which is a good particular use case for memexport (basically running a 2D compute grid), if it has half-pixel offset enabled (by default on the Xbox 360, as it's the usual Direct3D 9 behavior) and draws the rectangle at integer pixel coordinates, the top-left pixel will only have the bottom-right sample covered — and if we choose any other sample, memexport won't be done for that pixel.

However, I've never seen any game doing memexport in pixel shaders so far, let alone doing memexport with MSAA. So I think we shouldn't bother overcomplicating handling of this scenario via some crazy interpolator math, and just do something that will be localized in a small region of code simply to handle that more or less safely, not necessarily perfectly.

There seems to be no way to directly emulate the Direct3D sample mask behavior in Vulkan until a new Vulkan extension is created for this purpose. Note that VKD3D and DXVK are also inherently affected by this issue — the existing check in Xenia will not work on VKD3D, although since we have a native Vulkan backend in development, I personally don't think that we should be introducing VKD3D-specific code in Xenia, rather, we should be fixing the Vulkan specification and the translation layers themselves in such cases so they can achieve higher compatibility with all applications.

Triang3l commented 4 months ago

A small note regarding feature scalability: converting memexporting vertex shaders to compute shaders will be important not only on mobile GPUs, but also on in-development Vulkan drivers for the old AMD Evergreen and Northern Islands (TeraScale 2/3, like HD 5xxx/6xxx) due to the very tight hardware UAV count limitation making it impossible to support vertexPipelineStoresAndAtomics, as well as possibly for Apple Silicon devices because of tiling and especially the complicated interaction of potential vertexPipelineStoresAndAtomics with software-emulated geometry functionality like geometry shaders and transform feedback.

Triang3l commented 4 months ago

Additional to-do: For sparse binding, we need not only BindSparse>Submit semaphore synchronization, but probably Submit>BindSparse too, even within one queue, otherwise the page table may be modified while the previous graphics work (also referencing the same resource, like the shared memory buffer or resolution-scaled data) is still being executed, potentially causing corruptions and GPU hangs. Also, queue family ownership transfers must be done if sparse binding operations are done on a different queue family (or the concurrent sharing mode should be used, but the performance implications of it for buffers are unknown).

hardBSDk commented 3 months ago

@Triang3l What is the current progress of the Vulkan renderer?

Triang3l commented 3 months ago

@Triang3l What is the current progress of the Vulkan renderer?

The first message in the thread describes the current state. I've just updated it to reflect that memory export is now supported, though without most downlevel device fallbacks yet.

PatrickvL commented 3 months ago

On the topic of texture format conversion and caching:

For any guest resource (of any type) that is converted into a host resource with a different size, the original guest size is still what should be used for accurately emulating guest cache eviction. (Perhaps this is already implemented, in which case I'm fine with deleting this remark.)

insaneninja117 commented 3 months ago

Would there be any preferable way to assist with debugging Vulkan issues via testing? For example RenderDoc or Xenia.log submissions? One big issue of the new renderer versus the old one is constant crashing with the windows error dialog citing the amdvlk64.dll as the issue. The fault code is always the same so there is something this is doing that the AMD drivers do not like whilst the previous rendition would fail (if it did) for other reasons.

xenia-project / xenia

Vulkan-based GPU emulation implementation #2028

Stage 1: From draw commands to presentation

Stage 2: Direct3D 12 parity

Stage 3: Vulkan-specific parts

Bonus stage: Further improvements