ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.24k stars 2.5k forks source link

Add SIMD Support #903

Open tiehuis opened 6 years ago

tiehuis commented 6 years ago

Current Progress


SIMD is very useful for fast processing of data and given Zig's goals of going fast, I think we need to look at how exposing some way of using these instructions easily and reliably.

Status-Quo

Inline Assembly

It is possible to do simd in inline-assembly as is. This is a bit cumbersome though and I think we should strive for being able to get any speed performances in the zig language itself.

Rely on the Optimizer

The optimizer is good and comptime unrolling and support helps a lot, but it doesn't provide guarantees that any specific code will be vectorized. You are at the mercy of LLVM and you don't want to see your code lose a huge hit in performance simply due to a compiler upgrade/change.

LLVM Vector Intrinsics

LLVM supports vector types as first class objects in it's ir. These correspond to simd instructions. This provides the bulk of the work and for us, we simply need to expose a way to construct these vector types. This would be analagous to the __attribute__((vector))__ builtin found in C compilers.


If anyone has any thoughts on the implementation and or usage then that would be great since I'm not very familiar with how these are exposed by LLVM. It would be great to get some discussion going in this area since I'm sure people would like to be able to match the performance of C in all areas with Zig.

abique commented 6 years ago

I think relying on the compiler vector type is a good solution. Both LLVM and GCC have it. If they're not present, you can always have a generic "software" fallback.

Syntax: you need a way to describe a vector type, an idea could be:

const value = <[]> f32 {0, 13, 23, 0.4};

So <[ N_ELTS ]> type would be the bracket style for vectors in this examples.

Also vector are use essentially for arithetic so regular artimetic should work.

Importants things:

The standard library should also provide simd version of cos, sin, exp and so on.

andrewrk commented 6 years ago

How about adding operators for arrays? Example:

const std = @import("std");

test "simd" {
    var a = [4]i32{1, 2, 3, 4};
    var b = [4]i32{5, 6, 7, 8};
    var c = a + b;
    std.debug.assert(mem.eql(i32, c[0..], [4]i32{6, 8, 10, 12} ));
}

This would codegen to using vectors in LLVM.

abique commented 6 years ago

I believe you'll find out that using arrays for "simd vector" introduces more problems than solutions, and that's why llvm and gcc went a different way.

First thing is that they might have different alignment requirement. Plus those vector are supposed to be stored in a single register in the end, so you might want to codegen differently depending on vector / array maybe.

I also worked on a private DSL, and we had the distinction between vectors and array from the typing, and it was fine as far as I can tell. The vector type also provides useful information while doing the semantic analysis, and you see what you get. Otherwise you have some array magic which is exactly the kind of things that people want to avoid when switching to your new language right?

abique commented 6 years ago

https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

andrewrk commented 6 years ago

I think you're right - the simplest thing for everyone is to introduce a new vector primitive type and have it map exactly to the LLVM type.

BraedonWooding commented 6 years ago

Also keep in mind the rsqrtss command and others which are seriously fast on systems that support them showing speed increases of 10x. This article here demonstrates some of the differences well; http://assemblyrequired.crashworks.org/timing-square-root/ and here; http://adrianboeing.blogspot.com.au/2009/10/timing-square-root-on-gpu.html.

We should aim to try to utilise this set of faster functions when we can.

lmb commented 6 years ago

I just stumbled on this. There is a blog post series by a (former?) Intel engineer who designed a compiler for a vectorized language: http://pharr.org/matt/blog/2018/04/18/ispc-origins.html At the least an interesting read, but maybe good inspiration as well.

abique commented 6 years ago

Dense and interesting articles!

BarabasGitHub commented 6 years ago

One thing to keep in mind here is that even though you can vectorize scalar code, there are a lot of operations that are supported by simd instructions which you can't do in 'normal' scalar code. Such as creating bit fields from floating point comparisons to later use them in bitwise operations (often to avoid branches). Plus there are integer operations which expand to wider integers and other special stuff.

The series of articles linked by @lmb also show well what the difference can be between code/compiler that's designed for SIMD and code/compiler that isn't.

andrewrk commented 5 years ago

In the above commit I introduced the @Vector(len, ElemType) builtin to create vector types, and then I implemented addition (but I didn't make a test yet, hence the box is unchecked). So the effort here is started. Here is what I believe is left to do:

No mixing vector/scalar support. Instead you will use @splat(N, x) to create a vector of N elements from a scalar value x. Reasoning for this is that it more closely matches the LLVM IR. So for example multiplication would be:

fn vecMulScalar(v: @Vector(10, i32), x: i32) @Vector(10, i32) {
    return v * @splat(10, x);
}
abique commented 5 years ago

The syntax looks ugly but if it works as good as the llvm builtin vectors, then it is fine! ;-)

Thank you, and don't forget the shuflle vector!

abique commented 5 years ago

What do you think of v10i32 ?

andrewrk commented 5 years ago

What do you think of v10i32 ?

A few things:

Please do feel free to propose syntax for a vector type. What's been proposed so far:

This syntax hasn't been rejected; I'm simply avoiding the syntax question until the feature is done since it's the easiest thing to change at the very end.

abique commented 5 years ago

Why would you want a vector of pointers? Can you do a vector load from that? Would that even be efficient? Do you want people to do vectorized pointer arithmetic? :octopus:

I'd go with v4f32 style! People will really enjoy writing simd with that style. But of course it does not work with templates... :) So might need a more verbose type declaration indeed.

andrewrk commented 5 years ago

Why would you want a vector of pointers?

Mainly, because LLVM IR supports it, and they're usually pretty good about representing what hardware generally supports. We don't automatically do everything LLVM does, but it's a good null hypothesis.

Can you do a vector load from that?

Yes you can, which yields a vector. So for example you could have a vector of 4 pointers to a struct, and then obtain a vector of 4 floats which are their fields:

const Point = struct {x: f32, y: f32};
fn multiPointMagnitude(points: @Vector(4, *Point)) @Vector(4, f32) {
    return @sqrt(points.x * points.x + points.y * points.y);
}

It's planned for this code to work verbatim once this issue is closed.

Not only can you do vector loads and vector stores from vectors of pointers, you can also do @maskedGather, @maskedScatter, and more. See the LLVM LangRef links in the comment above for explanations.

travisstaloch commented 5 years ago

How are we supposed to initialize a vector? I couldn't find an example in the newest code. Or is this not implemented yet?

For example, the following doesn't work:

test "initialize vector" {
    const V4i32 = @Vector(4, i32);
    var v: V4i32 = []i32{ 0, 1, 2, 3 };
}
andrewrk commented 5 years ago

Your example is planned to work. That's the checkbox above labeled "implicit array to vector cast".

andrewrk commented 5 years ago

@travisstaloch the array <-> vector casts work now. Here's the passing test case:

https://github.com/ziglang/zig/blob/8c6fa982cd0a02775264b616c37da9907cc603bb/test/stage1/behavior/vector.zig#L4-L19

bugabinga commented 5 years ago

Another thing to consider is; Should zig abstract over the SIMD vector width because of all the different processors out there?

 test "implicit array to vector and vector to array" { 
     const S = struct { 
         fn doTheTest() void { 
             const w: usize = @MaxSimdWidth();
             var v: @Vector(w, i32) = undefined; 
             for (v) |*v_elem, v_idx| {
                  v_elem = pickSomeScalarFromVPool(v_idx); 
             }
             var x: @Vector(w, i32) = undefined; 
             for (x) |*x_elem, x_idx| {
                  x_elem = pickSomeScalarFromXPool(x_idx); 
             }
             v +%= x; 
             const result: [w]i32 = v; 
             for (result) |*r, i| {
                 assertOrPanic(r == (pickSomeScalarFromVPool(i) + pickSomeScalarFromXPool(i))); 
             }
         } 
     }; 
     S.doTheTest(); 
     comptime S.doTheTest(); 
} 

Pros:

Cons:

shawnl commented 5 years ago

There is also the need for doing branches on vectors. Every regular comparison operator can suddenly return three values, none, some, or all, however if comparisons on vectors are made non-communicative, then this can just be two states, where the left side is any, and the right side is all. If comparing two vectors for full all is desired, then you would have to specify the comparison twice with the operands reversed (even for equality).

I don't see any other way to do this succinctly.

shawnl commented 5 years ago

@bugabinga LLVM already does that: https://simd.godbolt.org/z/k2-thC

bugabinga commented 5 years ago

@shawnl pretty fascinating. Godbolt is as well!

https://simd.godbolt.org/z/zCj88E

No idea if this runs correctly, but it compiles ;) Check out the output when changing the vector width.

According to https://clang.llvm.org/docs/LanguageExtensions.html#vector-operations, comparison operators are supported. What do you make of this?

shawnl commented 5 years ago

comparison operators are supported.

Yeah I'm working on that.

* [ ] Implicit sign-extend widening conversion (to either signed or unsigned) from vectors of bool. (which are returned by comparisons) This use case is better met by a vec_sel.

I don't really like @maskedExpandLoad and @maskedCompressStore. While these match cpu instructions, they are combining many disparate concepts unelegently. The LLVM type system means it kinda has to be done this way, but that doesn't make it any prettier.

dimenus commented 5 years ago

What about horizontal add/mul?

eg: http://llvm.org/docs/LangRef.html#llvm-experimental-vector-reduce-v2-fadd-intrinsic

shawnl commented 5 years ago

@dimenus Those all look like very specialty instructions, and we should try to let the optimizer use them from more generic code. That said, my own all(), any(), and none(), could be implemented and expressed with a horizontal bitwise AND and OR. Also, those are marked experimental, which means that LLVM does not consider them stable.

@dimenus Horizontal operators might be more elegant, because they would be no-ops on scalars. But I can't think of a good syntax right now. Maybe

[a +]
[a *]
[a &]
[a |]
[a ^]

(with or without the space)

Then any() becomes [a |], all() becomes [a &], and none() becomes ![a |]. Those English keywords certainly only keep their simplicity with vectors of bools, but this suggested syntax is more flexible and results in a simpler language. Also, I will need to play with the fmul instruction, cause that one looks unique.

Alternatively, maybe the operators could become function pointers, and we could pass them to a build-in vector function reduce(), like so:

a.reduce(+)
a.reduce(*)
a.reduce(&)
a.reduce(|)
a.reduce(^)

This is much more flexible, and I kinda like it.

shawnl commented 5 years ago

I'm debating if all() should be an implicit cast to bool from vector bool (and then none() would also be eliminated as its inverse). This would make == and != on vectors work as expected. It would not be what would be expected if we had implicit cast from int, but zig does not have that. any() would be retained. Related: #2698

abique commented 5 years ago

I've been using https://github.com/p12tic/libsimdpp/ recently and I like it a lot, have a look :-)

shawnl commented 5 years ago

@abique The problem with that approach is that the compiler can't optimize it. However, I appreciate input from those with SIMD experience.

JesseRMeyer commented 4 years ago

Why would you want a vector of pointers?

Mainly, because LLVM IR supports it, and they're usually pretty good about representing what hardware generally supports

Modern x64_64 hardware can speculatively prefetch a batch of pointers in an array. It's easy to perform the experiment yourself by comparing access times of arrays of pointers vs a linked list. Refer to Fabian Giesen's whirlwind tour of dataflow graphs for further information: https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/

It's not an instruction that achieves this. It's built-in hardware behavior.

data-man commented 4 years ago

len property on the type

Current workaround:

pub fn len(v: var) usize {
    return @typeInfo(@TypeOf(v)).Vector.len;
// or return @sizeOf(@TypeOf(v)) / @sizeOf(@TypeOf(v[0]));
}
data-man commented 4 years ago

len property on the type

3941

std.format

3904

shawnl commented 4 years ago

My patch set had the len property working.

El jue., 19 dic. 2019 8:28, Dmitry Atamanov notifications@github.com escribió:

len property on the type

3941 https://github.com/ziglang/zig/pull/3941

std.format

3904 https://github.com/ziglang/zig/pull/3904

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ziglang/zig/issues/903?email_source=notifications&email_token=AAD4W4XNIDSTATKRLWFHWNDQZLZ7VA5CNFSM4EZMHEZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHILTRQ#issuecomment-567327174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4W4XMSUP4ZUMJU44TFCDQZLZ7VANCNFSM4EZMHEZA .

shawnl commented 4 years ago

I propose that we use optional types in vectors to model LLVM vector predication. This will use a second predication vector, where false means null and true means it has a value. If the vector is asserted as non-null, then if it contains nulls it is not undefined behavior, but results in undefined values. (to use this with pointers you would have to use an allowzero pointer, otherwise null is the value 0)

Also needed would be a syntax for a "merge" operation to combine a vector and a vector of bools into a optional (predicated) vector, but that is not immediately needed. I think this would also be a sensible way to approach closures, where a function pointer that has a closure type signature must be merged with a closure frame before it can be called.

BarabasGitHub commented 4 years ago

I have a question about vectors of bool and comparisons in general.

For example, I wrote this code in C++ using the equivalent of @vector(4, f32)

        auto distance = ...
        auto greather_than_mask = _mm_cmpgt_ps(distance, _mm_setzero_ps());
        if (_mm_movemask_ps(greather_than_mask) != 0)
        {
            distance = _mm_and_ps(distance, greather_than_mask);
            ...
        }

The greater_than_mask variable here is still just a vector of f32 just with all bits set, which is then used to zero out any elements I don't want and it does a higher level check if we have to do anything at all by checking if any of the 'sign bits' are set (or rather anything was greater than zero).

How would that translate to Zig? What is @vector(4, bool)? Is it 4 bits, 4 bytes, 4x32bits or even something else? I would like to be able to write something such as:

        var distance = ...
        const greather_than_mask = distance > @splat(4, @as(f32, 0));
        if (any(greather_than_mask)) {
            distance = distance & greather_than_mask;
            ...
        }
shawnl commented 4 years ago

That is pretty much exactly the way my patch series worked (with the any, which was implemented zig from the return of a result, which is bool, and I'm pretty sure got merged), which you can check out. Only a small part ever got merged. f32 cannot have all bits set, as that makes no sense according to IEEE 754---it is really a i1 sign-extended to 32-bits.

BarabasGitHub commented 4 years ago

That is pretty much exactly the way my patch series worked

Cool! 👍

f32 cannot have all bits set, as that makes no sense according to IEEE 754

I know it doesn't make sense for math, but that is how these operations work. And it makes sense if you use them as a mask for bitwise operations later, which is their main purpose. Converting it to some other intermediate data-type seems wasteful. I don't know what a good way to handle it is. One would be to have multiple bool types such as mr Agner Fog has in his VectorClass library, but I can see that being cumbersome. It probably is the most efficient though. Something to think about I guess. 🤔

shawnl commented 4 years ago

Lo

El mié., 8 abr. 2020 17:28, Bas notifications@github.com escribió:

That is pretty much exactly the way my patch series worked

Cool! 👍

f32 cannot have all bits set, as that makes no sense according to IEEE 754

I know it doesn't make sense for math, but that is how these operations work. And it makes sense if you use them as a mask for bitwise operations later, which is their main purpose. Converting it to some other intermediate data-type seems wasteful.

It is only the type. The data and the computer does exactly what you are describing. (x86 is a bit different with mask registers) That is what static typing is.

I don't know what a good way to handle it is. One would be to have multiple bool types such as mr Agner Fog has in his VectorClass library https://github.com/vectorclass/version2, but I can see that being cumbersome. It probably is the most efficient though. Something to think about I guess. 🤔

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ziglang/zig/issues/903#issuecomment-610959391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4W4SVPGQRDCOP6JNV3ITRLR3XLANCNFSM4EZMHEZA .

shawnl commented 4 years ago

El mié., 8 abr. 2020 17:28, Bas notifications@github.com escribió:

That is pretty much exactly the way my patch series worked

Cool! 👍

f32 cannot have all bits set, as that makes no sense according to IEEE 754

I know it doesn't make sense for math, but that is how these operations work. And it makes sense if you use them as a mask for bitwise operations later, which is their main purpose. Converting it to some other intermediate data-type seems wasteful. I don't know what a good way to handle it is. One would be to have multiple bool types such as mr Agner Fog has in his VectorClass library https://github.com/vectorclass/version2, but I can see that being cumbersome. It probably is the most efficient though. Something to think about I guess. 🤔

AHH yes. LLVM does this for us.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ziglang/zig/issues/903#issuecomment-610959391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4W4SVPGQRDCOP6JNV3ITRLR3XLANCNFSM4EZMHEZA .

BarabasGitHub commented 4 years ago

Thanks. I didn't know. That's cool. Compilers are amazing sometimes. :)

Snektron commented 4 years ago

I would also like to request @boolToInt the list of operations defined here: https://github.com/ziglang/zig/issues/903#issuecomment-459508820

Furthermore, i think its maybe useful to have a select/blend operation:

@blend(comptime E: type, a: std.meta.Vector(len, E), b: std.meta.Vector(len, E), mask: std.meta.Vector(len, bool)) std.meta.Vector(len, E))

which selects elements from a and b based on the corresponding value of the mask for each channel. Note that LLVM provides an intrinsic for this as well: https://llvm.org/docs/LangRef.html#select-instruction.

shawnl commented 4 years ago

You can implement this today in zig with the shuffle operation.

On вт, 30 июн. 2020 г., 4:24 Robin Voetter notifications@github.com wrote:

I would also like to request @boolToInt the list of operations defined here: #903 (comment) https://github.com/ziglang/zig/issues/903#issuecomment-459508820

Furthermore, i think its maybe useful to have a select/blend operation:

@blend(comptime E: type, a: std.meta.Vector(len, E), b: std.meta.Vector(len, E), mask: std.meta.Vector(len, bool)) std.meta.Vector(len, E));

which selects elements from a and b based on the corresponding value of the mask for each channel. Note that LLVM provides an intrinsic for this as well: https://llvm.org/docs/LangRef.html#select-instruction.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ziglang/zig/issues/903#issuecomment-651439791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4W4UHQUZVHX7CUM6JQU3RZEWDTANCNFSM4EZMHEZA .

Snektron commented 4 years ago

I think that would require first creating a vector with indices, and then conditionally applying a binary not on each index depending on the bool mask. To construct that you would either need to do a similar blend/select operation, or iterate over the vector channels and perform the operation yourself.

data-man commented 4 years ago

clang/docs/MatrixTypes clang/test/SemaCXX/matrix-type.cpp clang/test/SemaCXX/matrix-type-operators.cpp clang/test/SemaCXX/matrix-type-builtins.cpp

tadeokondrak commented 4 years ago

I think we should add builtin syntax like i32x4 or f64x4. It would be sufficient for almost all uses, and pointers can still be used with std.meta.Vector.

abique commented 4 years ago

I think we should add builtin syntax like i32x4 or f64x4. It would be sufficient for almost all uses, and pointers can still be used with std.meta.Vector.

I totally agree.

ghost commented 3 years ago

http://llvm.org/docs/LangRef.html#select-instruction @shuffle requires mask to be comptime, which means the branch must be known at comptime. @select will allow this to be done at runtime.

const c = @select(a < b, a, b);
shawnl commented 3 years ago

Oh cool! I never realized that @select could be used on vectors.

On Tue, Oct 13, 2020 at 10:28 AM floopfloopfloopfloopfloop < notifications@github.com> wrote:

http://llvm.org/docs/LangRef.html#select-instruction @shuffle requires mask to be comptime, which means the branch must be known at comptime. @select will allow this to be done at runtime.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ziglang/zig/issues/903#issuecomment-707520031, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4W4W6KB7SCI7NW3S655LSKPXP7ANCNFSM4EZMHEZA .

ghost commented 3 years ago

@floatCast is mentioned as todo, but @intCast @truncate, @as, and friends should also be included.

https://zig.godbolt.org/z/9nYcn4

const std = @import("std");

pub fn main() void {
    const v: i32 = 1;
    const a: @Vector(4, i32) = @splat(4, v);
    // These fail due to unexpected types
    const b = @intCast(@Vector(4, i64), a);
    const c = @as(@Vector(4, i64), a);
}
slimsag commented 3 years ago

Has there been thoughts already here around runtime switching of CPU SIMD feature sets? i.e. instead of compiling for a single instruction set (AVX2, AVX-512, SSE3, SSSE3 etc.) allowing compiling for multiple and, at runtime, choosing a branch that uses the latest and/or most efficient supported instruction set where reasonable?

ghost commented 3 years ago

@Vector(N, bool) doesn't have and, or defined, nor &, |, making them questionably useful.