Proposal: Generalise SIMD to arbitrary tensor types, remove footguns in vector syntax

ghost commented 3 years ago

This may be riiiight on the edge of the entropy ceiling, but I think the gains outweigh the losses.

First-class SIMD support is great, but currently it's limited to the case of one-dimensional element-wise operations -- more complex tensor architecture support is on the horizon (and of course GPUs have had it for years), and we don't want to be left in the dust. Furthermore, the premise of fitting a vector into any syntactic role that may be filled by a scalar leads to ambiguity in some cases.

This issue addresses both of these problems.

Isotropic SIMD

I propose an extension to #6771's syntax: allowing multiple indexes between bars, for arbitrary-rank tensors. For instance, a 4x4 matrix of floats is [|4, 4|]f32, and a rendering surface might be [3][|1920, 1080|]u8. Note that these need not correspond to hardware vectors, although when that is possible the compiler will guarantee it -- a tensor is a syntactic construct, in that it allows one written value to represent multiple parallel values. In particular, a scalar will coerce to a tensor populated with copies of it, whereas an array will not. The compiler may choose to represent tensors in whatever rank order it sees fit, to best optimise for hardware SIMD operation -- this is the primary motivation for condensing multiple indices into one functor.

Index Operations

The expressive power of tensors highlights a potential syntactic ambiguity also inherent in regular vectors -- does an operation apply to the tensor as a whole, or to its elements individually? For instance, if locs is a vector of multi-pointers, is locs[3] the fourth element of locs, or a gather vector (#3575)? In either case, how do we represent the other case?

I propose that any index of a tensor will be taken as an index into the tensor, so in the above case locs[3] is unambiguously the fourth element of locs. An index into a tensor must contain the appropriate number of indices. In cases where index operations on SIMD values are desired, some or all indices may simply be left out: locs[][3] is a gather vector, unambiguously. After one index, a tensor behaves syntactically like its child type, regardless of how many indices were actually collapsed -- it is compatible with primitive operators in expressions with scalars or other tensors of the same shape.

By this same method, we may have operations on arbitrary planes of arbitrary-rank tensors: for instance, a[i,] * b[,j] is the vector resulting from pairwise multiplying the elements of row i of matrix a by column j of matrix b. Non-collapsed indices are paired in order of listing. Such planed tensors may be lvalues or rvalues.

Iteration

for shall be expanded to operate on tensors. If an index is captured, all indices must be captured -- that is, if a is a matrix, then for (a) |e| and for (a) |e, i, j| are both allowed, but for (a) |e, i| is not. (Unwanted indices may be ignored with _, as usual.)

const matrixMultiply = fn (comptime T: type, comptime n: usize, a: [|n, n|]T, b: [|n, n|]T) [|n, n|]T {
    var acc: [|n, n|]T = undefined;

    for (acc) |*e, i, j| e.* = @reduce(.Add, a[i,] * b[,j]);

    return acc;
};

const transpose = fn (in: anytype) anytype {
    const Child = @TypeOf(in).child;
    const shape = @TypeOf(in).shape; // `shape` is the array of dimensions

    var result: [|shape[1], shape[0]|]Child = undefined;
    for (result) |*e, i, j| e.* = in[j, i];

    return result;
};

The compiler guarantees that if a for map over a tensor does not mutate external state or call non-inline functions, it will be vectorised along relevant axes (that is, entire rows or columns of the iteration will condense to single machine operations) -- when iteration order is significant, minor indices will be incremented first. If #6965 is accepted, we may abstract over the common case of initialising a tensor with a map. (Crazy idea: such an abstraction could be made rank-polymorphic if we had a way to capture all indices at once -- see https://github.com/ziglang/zig/issues/6965#issuecomment-725394637.)

Gather/Scatter

Any syntactic single index operation (array, multipointer, slice, vector) may take an integer tensor instead of an integer. The result of the operation is a tensor of the same shape as the index, populated by the results of indexing the object with every one of the elements. Gather/scattering a higher rank tensor would lead to ambiguity and is hence not allowed directly -- to do this, first eliminate any other indices: (a[,y,])[b] (the parentheses are necessary to apply the gascer to the whole plane and not its elements).

Coda

There is a danger, in any SIMD formalism, of making things too dense and "magical" (see: APL). I've tried hard to avoid that here -- I believe this proposal is the right balance of concise and expressive, and that it has just the right level of restriction to be legible without requiring a plethora of special cases to do anything useful. One downside is that it is some degree abstracted from the hardware -- however, I believe this is inevitable, given the varying capabilities of SIMD hardware present and future. It will at least be easier to optimise than hand-rolled loops, and in cases where performance comes before portability, intrinsic libraries are still available (#5241 even allows passing vectors directly to assembly).

SpexGuy commented 3 years ago

As I understand it, these are the tradeoffs of this proposal:

If you use tensors instead of arrays:

Gain multi-dimensional for loop
Gain vectorization (maybe)
Give up known memory layout
Give up one of: vectorization, ability to reuse code, ability to have small code
Give up ability to have runtime-known size

I'm not sure that I would ever choose this set of tradeoffs. I think its applications are pretty limited. The only benefit is that the compiler might vectorize your loops if the stars align, but it already does this. Things like divergent control flow break its ability to do that optimization, but this proposal doesn't address how to deal with that either. The need to have comptime-known length severely limits applications.

The applications mentioned in the proposal are

Matrices
- I think #4960 handles this case better. Why do we need to generalize to more than two dimentions here as part of the language?
framebuffers
- Framebuffers almost never have comptime known size. They also usually require a well-defined data layout.
more complex tensor architecture support
- I'm not really sure what this means or how this tensor primitive guarantees we'll be able to take advantage of it. GPUs are a good example of parallel hardware, but we can't just transparently run a loop on the GPU even with these changes. It doesn't address the allocations that need to happen (and can fail), the data transfer, the synchronization concerns, and the interactions with the driver that will require additional program dependencies. Not to mention more global things like device selection. Google's TPU hardware has the same problems. Assuming you're talking about this functionality being integrated into a CPU to remove those concerns, I don't think it's worth introducing features to the language for hardware that might exist someday. If there are any examples of hardware, either planned or on the market, that has tensor operations that can be natively interleaved with CPU operations, we should look at how those are programmed and use that information to motivate the design of this part of the language. Otherwise, "Let's make a really general box with a bunch of restrictions and hope that it will apply to hardware someday" isn't a very persuasive argument to me.
  
  Nvidia's Cuda compiler could be an example of a similar system that benefits from being part of the language, but it has an accompanying library for handling things like GPU memory allocation and data transfer. You can't really use it just by saying "this is a matrix". You also wouldn't want to, because the cost of transferring between CPU and GPU memory is often high, and GPUs tend to execute much slower than CPUs. They just make up for it with very large workloads by being really wide.

The compiler may choose to represent tensors in whatever rank order it sees fit, to best optimise for hardware SIMD operation -- this is the primary motivation for condensing multiple indices into one functor.

I don't know if the compiler really has enough information to do this. It would need to understand the iteration order of every loop you use the vector with, figure out what the optimal storage format is for each, and then somehow use a weighted average to pick the actual order. But the compiler can't really understand which loops are "hot" (important for performance) and which are cold, so it doesn't have a good way to compute the averaging weights. The compiler also doesn't know if the memory is expected to be in cache (optimize for vector construction) or cold (optimize for prefetch). I think it's probably best to leave storage format up to the programmer.

Additionally, if you ever do have to actually hand-roll a loop, it will be extremely important for performance to know how the vector is laid out in order to write the loop order correctly. Assuming the compiler does the loop analysis above, this means that as the programmer you need to pick the layout you want, and then write all of your loops so that the compiler will infer that layout.

You also mention the possibility of passing vectors to assembly, but that's only possible if they have a well-defined data layout.

The compiler guarantees that if a for map over a tensor does not mutate external state or call non-inline functions, it will be vectorized

This restriction doesn't really solve the problem -- divergent control flow can easily break the compiler's ability to vectorize, or make the resulting vector code slower than using scalar code.

It will at least be easier to optimize than hand-rolled loops

I would really like to see something backing up this claim. I don't really see any reason that the matrix multiplication loop you gave in your example is easier for a compiler to optimize than a normal matrix multiplication loop. In fact it might be harder since, for platforms without a reduce instruction (including the default x86_64 target), the compiler has to apply a pretty large set of mathematical identities to realize that this loop can (and should) be written as a series of vector adds instead:

// 4x4 matrix multiply (assuming row-major order)
// these gathers are slow without a hardware gather instruction
// optimizations can sometimes be made when doing all four of them together
const col0 = b.gatherColumnToSimdVector(0);
const col1 = b.gatherColumnToSimdVector(1);
const col2 = b.gatherColumnToSimdVector(2);
const col3 = b.gatherColumnToSimdVector(3);

// each of these is one load instruction
const row0 = a.loadRowAsSimdVector(0);
const row1 = a.loadRowAsSimdVector(1);
const row2 = a.loadRowAsSimdVector(2);
const row3 = a.loadRowAsSimdVector(3);

// inner loop is vectorized
// each of these lines should translate to a multiply, three multiply-adds, and a store.
result.storeRow(0, row0 * col0 + row0 * col1 + row0 * col2 + row0 * col3);
result.storeRow(1, row1 * col0 + row1 * col1 + row1 * col2 + row1 * col3);
result.storeRow(2, row2 * col0 + row2 * col1 + row2 * col2 + row2 * col3);
result.storeRow(3, row3 * col0 + row3 * col1 + row3 * col2 + row3 * col3);

The set of transformations needed to make this optimization are the same whether matrices are represented as tensors or just aligned arrays of floats. I think the approach in #4960 is better - single out the case of matrix multiply and provide intent for that, so that the compiler can best optimize this specific case. Obviously general improvements to the ability to transform and vectorize code are also welcome, but I don't see why improvements for the suggested tensors wouldn't also apply to other value types. The transpose loop you give is the same. You've expressed each element as a function of the original matrix, but array destructuring in the compiler will do this even with a normal transpose loop as long as it can prove no aliasing.

For instance, if locs is a vector of multi-pointers, is locs[3] the fourth element of locs, or a gather vector

This is answered in #5427: If the indexing value is a vector, this is a gather. If it's a scalar, this is an element access. So locs[3] returns [*]T, and locs[@splat(N, @as(usize, 3))] returns Vector(N, T).

Your proposed alternative rule doesn't have separability, which makes it really complicated. What does this do?

const vec: [|4|][*]T = ...;
const unk = vec[]; // what is @TypeOf(unk)?
const wat = unk[3];

Unless you're suggesting introducing some sort of "vector slice" type (which would undermine a lot of the benefits of this proposal), the only type for unk is [|4|][*]T. Which means that vec[][3] is the same thing semantically as vec[3].

Overall I think the abstractions introduced by this proposal could be useful for some types of programming, but I don't see why they need to be part of the language. A library like Eigen or numpy could be written with the language as it stands, and I don't see why the suggestions in this proposal make the optimizer's job any easier.

Some motivating concrete examples (that would make sense in production quality code) where tensors enable optimizations that the compiler couldn't do otherwise could change my mind.

ghost commented 3 years ago

I agree with most of the issues raised by @SpexGuy, but hardware mapping in particular is seriously problematic:

All kinds of external accelerators are off limits because of resource management and failure modes that cannot be hidden from the programmer in a language like Zig.
Scalable Vector Extensions (when they land) are designed specifically to enable vectorization of ordinary loops over ordinary arrays.
Current-generation SIMD support is highly fragmented and full of obscure limitations, so high-level constructs like the proposed tensors have limited power to exploit it.

It would be indeed very nice to have something better than a SIMD register type plus a long list of intrinsics for writing high-performance code, but I don't think that this is it.

ghost commented 3 years ago

Yes, this is very complex, isn't it. Something like this is better done as a library, and maybe tightening vectorisation guarantees in places (#7257 may help, potentially).

ziglang / zig