Open tiehuis opened 6 years ago
@vector(N, bool) doesn't have and, or defined, nor &, |, making them questionably useful.
You can @bitCast(@vector(N,u1), your_bool_vector)
and do whatever you want with the vector.
An interesting test for such an API would be whether one can implement useful artefact beyond number crunching... like High-speed UTF-8 validation or base64 encoding/decoding.
compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?
A related issue is that instructions sets are evolving. For example, the latest AWS graviton nodes support SVE/SVE2. The most powerful AWS nodes support a full range of AVX-512 instructions sets (up to VBMI2).
If you build something up that is unable to benefit from SVE2 or advanced AVX-512 instructions, then you are might not be future proof.
I agree emphatically with @lemire's comment above.
For even current fixed-pattern byte shuffling with @shuffle
, the resulting assembly seems quite bad, and I'm not sure what to write to get a SIMD load or store. I ported a 4x4 transpose to use @shuffle
today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(
The "correct" output for this function would be more like this.
I ported a 4x4 transpose to use
@shuffle
today here https://godbolt.org/z/j584eWsx6. I think it should be 4 loads, 8 instructions to do the transpose, and 4 stores, plus whatever other instructions to do with the calling convention. Every part of the function is a lot bigger than that :(
It gets a lot better with -O ReleaseFast
, -Drelease-fast=true
is for build.zig
files
cf https://godbolt.org/z/d6YvTfYGj
Has there been thoughts already here around runtime switching of CPU SIMD feature sets? i.e. instead of compiling for a single instruction set (AVX2, AVX-512, SSE3, SSSE3 etc.) allowing compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?
@slimsag yeah that's #1018
Current Progress
SIMD is very useful for fast processing of data and given Zig's goals of going fast, I think we need to look at how exposing some way of using these instructions easily and reliably.
Status-Quo
Inline Assembly
It is possible to do simd in inline-assembly as is. This is a bit cumbersome though and I think we should strive for being able to get any speed performances in the zig language itself.
Rely on the Optimizer
The optimizer is good and comptime unrolling and support helps a lot, but it doesn't provide guarantees that any specific code will be vectorized. You are at the mercy of LLVM and you don't want to see your code lose a huge hit in performance simply due to a compiler upgrade/change.
LLVM Vector Intrinsics
LLVM supports vector types as first class objects in it's ir. These correspond to simd instructions. This provides the bulk of the work and for us, we simply need to expose a way to construct these vector types. This would be analagous to the
__attribute__((vector))__
builtin found in C compilers.If anyone has any thoughts on the implementation and or usage then that would be great since I'm not very familiar with how these are exposed by LLVM. It would be great to get some discussion going in this area since I'm sure people would like to be able to match the performance of C in all areas with Zig.