ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.24k stars 2.5k forks source link

Add SIMD Support #903

Open tiehuis opened 6 years ago

tiehuis commented 6 years ago

Current Progress


SIMD is very useful for fast processing of data and given Zig's goals of going fast, I think we need to look at how exposing some way of using these instructions easily and reliably.

Status-Quo

Inline Assembly

It is possible to do simd in inline-assembly as is. This is a bit cumbersome though and I think we should strive for being able to get any speed performances in the zig language itself.

Rely on the Optimizer

The optimizer is good and comptime unrolling and support helps a lot, but it doesn't provide guarantees that any specific code will be vectorized. You are at the mercy of LLVM and you don't want to see your code lose a huge hit in performance simply due to a compiler upgrade/change.

LLVM Vector Intrinsics

LLVM supports vector types as first class objects in it's ir. These correspond to simd instructions. This provides the bulk of the work and for us, we simply need to expose a way to construct these vector types. This would be analagous to the __attribute__((vector))__ builtin found in C compilers.


If anyone has any thoughts on the implementation and or usage then that would be great since I'm not very familiar with how these are exposed by LLVM. It would be great to get some discussion going in this area since I'm sure people would like to be able to match the performance of C in all areas with Zig.

LemonBoy commented 3 years ago

@vector(N, bool) doesn't have and, or defined, nor &, |, making them questionably useful.

You can @bitCast(@vector(N,u1), your_bool_vector) and do whatever you want with the vector.

lemire commented 2 years ago

An interesting test for such an API would be whether one can implement useful artefact beyond number crunching... like High-speed UTF-8 validation or base64 encoding/decoding.

compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

A related issue is that instructions sets are evolving. For example, the latest AWS graviton nodes support SVE/SVE2. The most powerful AWS nodes support a full range of AVX-512 instructions sets (up to VBMI2).

If you build something up that is unable to benefit from SVE2 or advanced AVX-512 instructions, then you are might not be future proof.

sharpobject commented 1 year ago

I agree emphatically with @lemire's comment above.

For even current fixed-pattern byte shuffling with @shuffle, the resulting assembly seems quite bad, and I'm not sure what to write to get a SIMD load or store. I ported a 4x4 transpose to use @shuffle today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(

The "correct" output for this function would be more like this.

Sahnvour commented 1 year ago

I ported a 4x4 transpose to use @shuffle today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(

It gets a lot better with -O ReleaseFast, -Drelease-fast=true is for build.zig files cf https://godbolt.org/z/d6YvTfYGj

andrewrk commented 11 months ago

Has there been thoughts already here around runtime switching of CPU SIMD feature sets? i.e. instead of compiling for a single instruction set (AVX2, AVX-512, SSE3, SSSE3 etc.) allowing compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

@slimsag yeah that's #1018