Mesh shader discussion - Githubissues

Venemo commented 2 years ago

Not sure what is the best place to talk about this so decided maybe we can discuss it here. Hope this is okay.

Looking at the current code, I noticed that the mesh shader workgroup size is 64, but the shader has: max_vertices = 64, max_primitives = 124 this means that the shader is going to have poor occupancy on AMD HW, effectively leaving 50% of shader invocations under-utilized. Note that this is also suboptimal on NVidia HW which prefers a workgroup size of 32.

I recommend to have a compile-time constant for each of these values (similar to what you do for MESH_WGSIZE) and configure it like this:

AMD: workgroup size == max output vertices == max output primitives >= 64, all outputs accessed by local invocation index
Nvidia: workgroup size = 32, max output vertices = 64, max output primitives >= 64

You can achieve this by using a "compile-time loop" (a loop using the compile-time constants) which will be optimal on both AMD and NVidia.

Venemo commented 2 years ago

Also, you don't have to hardcode these in your program, you can use maxPreferredMeshWorkGroupInvocations and prefersLocalInvocation*Output from the device properties to determine these things.

zeux commented 2 years ago

One issue with this is that all of this requires rebuilding meshlet data, something that would ideally be done offline. It looks as if on AMD hardware specifically the shader is export-bound atm, I've looked a little bit at adding per-triangle culling and it does help performance significantly.

Without per-triangle culling I don't seem to get any benefit from moving to the same number of primitives; with it however I do get better throughput with 64 max primitives, but that's a little too low. I'll test different configurations when I get time. Thanks for the suggestion!

zeux commented 2 years ago

I'm also wondering what happens on NV specifically from work groups with size 64 vs 32 - and whether the shader that the driver runs is substantially different wrt performance from a shader that uses work group size 32 but has to process two vertices per invocation (which is what the shader I used for NV extension did, but there it wasn't possible to test wider groups because NV extension requires work group size of 32 if I'm not mistaken).

Venemo commented 2 years ago

No, it doesn't require rebuilding meshlet data. A workable compromise is if you use a meshlet size of 64 (max 64 vertices and max 64 primitives). In this case on NVidia you would output 1 meshlet per workgroup and on AMD you could output 2 meshlets per workgroup.

I personally haven't tested this but would be interesting to compare how different configs perform.

Venemo commented 2 years ago

To your NVidia question: this is explained in one of their mesh shader blogs. As far as I understand NVidia's problem is that it doesn't have proper workgroups, so the whole mesh shader workgroup is executed in a single warp and they emulate the workgroup using a loop.

Therefore, you can get closer to what NVidia hardware actually runs if you use a workgroup size that matches their warp size.

zeux commented 2 years ago

No, it doesn't require rebuilding meshlet data. A workable compromise is if you use a meshlet size of 64 (max 64 vertices and max 64 primitives). In this case on NVidia you would output 1 meshlet per workgroup and on AMD you could output 2 meshlets per workgroup.

What I meant is that varying the max sizes between vendor requires different meshlet data. Just varying workgroup configurations doesn't of course.

64 & 64 is a little problematic depending on the mesh topology - I'd expect that the 64 vertex limit leads to an effective primitive count between 64 and 98 (98 corresponds to an 8x8 grid). Setting the primitive count limit of 64 limits to something like 45 vertices per meshlet for smooth meshes, so you end up underutilizing the threads for vertex transformation.

One other alternative is something like 128 vertices and 192 primitives, which is more balanced wrt the ratio, but still problematic because now this means we need to write all vertex data to LDS :)

To your NVidia question: this is explained in one of their mesh shader blogs. As far as I understand NVidia's problem is that it doesn't have proper workgroups, so the whole mesh shader workgroup is executed in a single warp and they emulate the workgroup using a loop.

Right, but a work group of 64 would be compiled into two sequential invocations of 32 elements each, vs a shader that uses more or less the same loop if it needs to process a meshlet with >32 vertices/primitives. I understand that using a work group of 64 doesn't match the hardware perfectly, but the question is where the resulting inefficiencies come from.

Venemo commented 2 years ago

What I meant is that varying the max sizes between vendor requires different meshlet data. Just varying workgroup configurations doesn't of course.

Yes, the trick is to find a meshlet size which can work fine on both vendors, and then you can use the same meshlet size but with a slightly different workgroup config.

64 & 64 is a little problematic depending on the mesh topology - I'd expect that the 64 vertex limit leads to an effective primitive count between 64 and 98 (98 corresponds to an 8x8 grid). Setting the primitive count limit of 64 limits to something like 45 vertices per meshlet for smooth meshes, so you end up underutilizing the threads for vertex transformation.

One other alternative is something like 128 vertices and 192 primitives, which is more balanced wrt the ratio, but still problematic because now this means we need to write all vertex data to LDS :)

I think it's worth to experiment with a meshlet size of: max vertices = 128, max primitives = 128 and then use a 128-sized workgroup on AMD and 32 (or 64?) on NVidia.

Right, but a work group of 64 would be compiled into two sequential invocations of 32 elements each, vs a shader that uses more or less the same loop if it needs to process a meshlet with >32 vertices/primitives. I understand that using a work group of 64 doesn't match the hardware perfectly, but the question is where the resulting inefficiencies come from.

Unfortunately I don't know any more details other than what I told above, only that this is their recommendation.

Venemo commented 2 years ago

I think it's worth to experiment with a meshlet size of: max vertices = 128, max primitives = 128 and then use a 128-sized workgroup on AMD and 32 (or 64?) on NVidia.

One more thought about this. If you definitely don't want to increase the number of max output vertices but you want to use max 128 output primitives, it is still worth it (on AMD) to increase the workgroup size to 128 and make your primitive processing more parallel than it currently is.

zeux commented 2 years ago

Can you elaborate on why on AMD there's a benefit to going above 64? It's not intuitively obvious that this should help as 64 (and sometimes 32) is the HW wavefront size.

Venemo commented 2 years ago

On RDNA2 each invocation can only really create max 1 vertex and 1 primitive. Any other kind of access pattern is emulated by the driver. This also implies that it may need to launch more invocations than your specified workgroup size in order to fit a larger output.

If you have a workgroup size of 64 but a max primitive count of 126 then the "real" workgroup size will be 126 (this fits 2 waves, which have 128 invocations):

First 64 invocations will execute the code you wrote
Next 62 invocations will sit there, just waiting for the first 64 to finish executing your code, then they will output 1 primitive
The last 2 invocations will be just sitting there deactivated the whole time

So, in fact there are 128 invocations running but you don't utilize all of them. It is more efficient to write your code in a manner that utilizes all invocations instead of letting them sit there doing nothing most of the time.

I try to explain this in my blog post "How mesh shaders are implemented in an AMD driver".

zeux commented 2 years ago

Ah, that explains a lot! It's indeed substantially different compared to NV model. I didn't realize that the restriction on emission also applies to primitives, as I thought it's just the vertices.

Venemo commented 2 years ago

It seems that a few others also struggle to understand this, eg. GravityMark has the same problem. So I think I explained it poorly... Can you suggest a good way to edit my blog post to clarify this?

zeux commented 1 year ago

By the way, at least in radv it looks like mesh shaders are always compiled with wave size 64. Do you know if this is a hardware restriction or a driver limitation? I can't currently test any other AMD drivers with mesh shading support...

The reason I ask is I was hoping for something like max_vertices=64 max_triangles=96 to work reasonably well with wave32 but it looks like this is inefficient as it effectively uses the same wave configuration as max_vertices=64 max_triangles=124.

zeux commented 1 year ago

Also based on https://github.com/GPUOpen-Drivers/llpc/commit/772eef3ecbb5d294ba033a1da00a06526a3a31e1 my understanding is that on GFX11 (RDNA3) row export would allow emitting more than one vertex or primitive per thread, which would be great as it would provide the much needed flexibility wrt balancing performance. Not sure if GFX11 has other relevant changes for mesh shading.

Venemo commented 1 year ago

By the way, at least in radv it looks like mesh shaders are always compiled with wave size 64. Do you know if this is a hardware restriction or a driver limitation?

It's just the default in our driver. You can use the RADV_PERFTEST=gewave32 environment variable to use Wave32 mode for geometry processing shaders.

The reason I ask is I was hoping for something like max_vertices=64 max_triangles=96 to work reasonably well with wave32

Worth a try. Yes it would be inefficient in Wave64 mode. Maybe we should add special casing for 32 and 96.

my understanding is that on GFX11 (RDNA3) row export would allow emitting more than one vertex or primitive per thread

This is correct, but I haven't implemented that in RADV yet. (I am on vacation this week and will get back to work next week.) However, it will still need some shuffling between SIMD lanes.

Not sure if GFX11 has other relevant changes for mesh shading.

Yes, it also has a new "fast launch" mode, which will eliminate the need for launching shader invocations that "do nothing".

zeux commented 1 week ago

I've switched to an RDNA3 GPU, didn't get a chance to look into this a lot yet but a couple observations:

It looks like the performance is much more consistent between different meshlet configurations; this should make it much easier to tune which is good.
Software triangle culling in mesh shader doesn't seem as relevant as it was on RDNA2. This is also great, because it reduces the amount of tradeoffs involved wrt occupancy/ALU et al.

Overall things look closer to what I expected from an NV GPU so that's good news.

There's also two issues I ran into:

On radv, it looks like the mesh shading path in the default configuration (64/64) is a little slower vs AMDVLK with the same code. I see something like ~7.9ms for the render pass (4K full screen, 7900 GRE) with radv, and something like ~7.7ms with AMDVLK (or AMDGPU-PRO) on Linux. I am running into some unrelated performance issues on Windows so can't test that yet. Haven't had time to investigate this yet.
On all 3 Linux drivers and 1 Windows driver, it looks like unlike RDNA2, the cost of dispatching "empty" mesh workgroups via EmitMeshTasksEXT(0, 1, 1) is very high, much higher than NVidia or RDNA2 (at least on radv from last year; I don't have a capable RDNA2 to compare directly now...). This is unfortunate; I'm not sure if this is a hardware issue (which seems more likely as all drivers have the same behavior) or if this can be worked around in the drivers or shaders somehow. This impacts two-phase cluster occlusion culling, which means I'd need to figure out how to rework it - probably that means not using the task shaders in the first place :(

zeux / niagara

Mesh shader discussion #30