ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.58k stars 2.53k forks source link

Add packing and unpacking selection operations for Vectors #15837

Open Validark opened 1 year ago

Validark commented 1 year ago

This is a proposal to add @packSelect and @unpackSelect. These operations are analogous to pdep/pext/@extractBits/@depositBits (see: https://github.com/ziglang/zig/issues/14995), except this proposal is for Vectors, not fixed-length bitvectors (i.e. integers).

packSelect

I define @packSelect(mask: @Vector(VEC_SIZE, bool), vector: @Vector(VEC_SIZE, VEC_TYPE)) which packs the vector into the left-hand side according to mask. It's basically like pext but it operates on vector lanes instead of bits. This is equivalent to VPCOMPRESS on new x86_64 machines. However, even without VPCOMPRESS support, this is a very common operation that can be performed in a wide variety of ways. Here are some stackoverflow questions about this:

https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask https://stackoverflow.com/questions/28735461/shift-elements-to-the-left-of-a-simd-register-based-on-boolean-mask https://stackoverflow.com/questions/25074197/compact-avx2-register-so-selected-integers-are-contiguous-according-to-mask https://stackoverflow.com/questions/7886628/optimizing-array-compaction

Motivating example

When parsing a file, we often want to copy some of a file over to a buffer. Let's say we are reading a JSON file into vectors of size 64 and the first 64 characters are "name": "Validark", "is_programmer": true, "favorite_color": "re. Here is this information as Zig code:

const VEC_SIZE = 64;
const buffer = "\"name\": \"Validark\", \"is_programmer\": true, \"favorite_color\": \"re";

Let's say we want to copy all of the characters between quotes into a buffer. For simplicity we are going to assume there are no escaped quotation marks within quoted strings. Here is the simplified inner loop of the first iteration:

const vec: @Vector(VEC_SIZE, u8) = buf[0..VEC_SIZE].*;

// generate a bitmask where all tabs in the vector correspond to a 1 in the bitmask
const quote_mask = @bitCast(std.meta.Int(.unsigned, VEC_SIZE), vec == @splat(VEC_SIZE, @as(u8, '"')));

// create a mask where all the characters between quotes and the open quote are marked.
const quoted_and_open_quote_mask = prefix_xor(quote_mask);

// unset (open) quotes so the mask only has the bits between quotes set
const quoted_mask = quoted_and_open_quote_mask & ~quote_mask;

// Pack the quoted strings into the front of the vector
const packed_vector = @packSelect(@bitCast(@Vector(VEC_SIZE, bool), quoted_mask), vec);

// Write the quoted strings into some `chars` buffer.
chars[0..VEC_SIZE].* = packed_vector;
// Make sure to allocate at least VEC_SIZE extra slots in chars!

// Advance the chars slice by how many valid characters we wrote.
chars = chars[@popCount(quoted_mask)..];

Here are the vectors/bitvectors:

             LSB<-------------------------------------------------------->MSB
vec:         "name": "Validark", "is_programmer": true, "favorite_color": "re
quote_mask:  1000010010000000010010000000000000100000000100000000000000100100
quoted_mask: 0111100001111111100001111111111111000000000011111111111111000011
packed_vec:  nameValidarkis_programmerfavorite_colorre.......................
Click here for more information on the prefix_xor operation ## prefix_xor See this article: https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/ Here is an implementation of it with https://github.com/ziglang/zig/issues/9631: ```zig /// Given a bitmask, returns a bitmask where every pair of 1's is filled in. /// The rightmost bit of each pair is set, but the leftmost is unset. /// e.g.: /// MSB<-------------------------------------------------------->LSB /// bitmask: 0000000001000000010000010000010000010000010000000100000100000000 /// returns: 0000000000111111110000001111110000001111110000000011111100000000 fn prefix_xor(bitmask: anytype) @TypeOf(bitmask) { const all_ones = std.math.maxInt(@TypeOf(bitmask)) + std.math.minInt(@TypeOf(bitmask)); return @mulCarryless(bitmask, all_ones); } ``` Here is another implementation that does not rely on carryless multiply. Hopefully one day LLVM will know this is equivalent when carryless multiply is not supported (or very slow) on a particular machine: ```zig fn prefix_xor(bitmask: anytype) @TypeOf(bitmask) { var x = bitmask; inline for (0..(@typeInfo(std.math.Log2Int(@TypeOf(bitmask))).Int.bits)) |i| x ^= x << comptime (1 << i); return x; } ``` (no guarantees when passing in integers that are not powers of 2) ---

What about the fill values?

In case it comes in handy for optimization purposes, it might be a good idea to make it undefined what the non-packed values are in a packed vector. Hopefully that also means it is undefined what bytes will end up in chars[@popCount(quoted_mask)..VEC_SIZE]; in the code above. Even writing nothing at all to those bytes should be valid in the case above. x86_64 has a variant which fills the rest of the vector with values from another source and a variant which fills it with zeroes. It could be nice to be able to specify which behavior you want. E.g., one could pass in src, or @splat(VEC_SIZE, @as(u8, 0)), or @splat(VEC_SIZE, @as(u8, undefined)) if you aren't relying on any particular behavior. However, this effect can already be achieved by creating a mask with std.simd.iota(u8, VEC_SIZE) >= @splat(VEC_SIZE, @as(u8, @popCount(quoted_mask))) and then doing a @select to move either src or @splat(VEC_SIZE, @as(u8, 0)) in the right places. Hopefully the optimizer will be smart enough at some point to know that that pattern still only needs 1 VPCOMPRESSB on AVX512_VBMI2 + AVX512VL x86_64 machines. Alternately, @packSelect could be made to take in a scalar value with which to fill in the empty spaces, but I think this might lead to someone filling with 0's, then doing a @select which turns all the 0's into the element from src. That would be very bad for the optimizer because it would then have to prove that 0's could not have been selected by @packSelect, which would be impossible to prove in cases like the example given above. Hence, I think making it fill with undefined's is the best move.

Other uses

Here are some more problems that can be solved with @packSelect: http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html https://lemire.me/blog/2017/04/10/removing-duplicates-from-lists-quickly/

Here is a fun snippet that would print the indices in a vector where tabs occur:

const vec: @Vector(VEC_SIZE, u8) = buf[0..VEC_SIZE].*;
const tabs = vec == @splat(VEC_SIZE, @as(u8, '\t'));
const num_tabs = std.simd.countTrues(tabs); // could also use @popCount with @bitCast
const tab_indices = @packSelect(tabs, std.simd.iota(u8, VEC_SIZE));
for (@as([VEC_SIZE]u8, tab_indices)[0..num_tabs]) |index| {
    std.debug.print("index: {}\n", .{index});
}

Daniel Lemire has some articles on how to efficiently iterate over set bits too. Hopefully code like the above could one day be optimized as well as the C++ code which uses intrinsics: https://lemire.me/blog/2022/05/10/faster-bitset-decoding-using-intel-avx-512/ https://lemire.me/blog/2022/05/06/fast-bitset-decoding-using-intel-avx-512/ https://lemire.me/blog/2019/05/15/bitset-decoding-on-apples-a12/

unpackSelect

The second part of this proposal is for @unpackSelect, which corresponds to VPEXPAND on x86_64, which is basically like PDEP but operates on vector lanes rather than bits. It's the opposite of @packSelect, so in the example above, you could spread out the bytes in the packed_vec back into the same positions as in vec by doing @unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), quoted_mask), packed_vec). Again, even without direct VPEXPAND support this operation can be done in a number of ways on different architectures. Note: I am using the same signature as @packSelect above. Here is a stackoverflow with one method given for accomplishing this operation:

https://stackoverflow.com/questions/48174640/avx2-expand-contiguous-elements-to-a-sparse-vector-based-on-a-condition-like-a

Uses

@unpackSelect is useful for a few situations I can think of:

  1. When you want to make more room in a vector
  2. When you want to place successive values in a vector into specific places in another vector
    • e.g. Let's say you have vec1 containing [a, b, c, d, e, f, g, h], and a mask that indicates you want to convert vec1 into [_, b, c, _, e, _, _, h], with each _ replaced by successive values in vec2 which contains [1, 2, 3, 4, 5, 6, 7, 8]. You need to get a vector with [1, _, _, 2, _, 3, 4, _] so a @select can use mask and vec1 to produce [1, b, c, 2, e, 3, 4, h]. The missing vector can be generated with @unpackSelect(@bitCast(@Vector(VEC_SIZE, bool), mask), vec2).

What about the fill values?

I think the same logic applies to @unpackSelect as @packSelect that it should be undefined what the non-relevant values are.

Other uses

More problems solvable with @unpackSelect/VPEXPAND: http://0x80.pl/notesen/2022-01-24-avx512vbmi2-varuint.html https://zeux.io/2022/09/02/vpexpandb-neon-z3/


Bikeshedding welcome.

matu3ba commented 1 year ago

A few questions from my side

Validark commented 1 year ago

A few questions from my side

    1. Could this be simplified as to not requiring new keywords, say if we use the same keywords as "pdep" and "pext" (@extractBits/@depositBits) or do you see particular semantic problems?

While it is conceivable that we could have an extremely generic @extract and @deposit (extremely imprecise terms I shudder to think about, I think @compress and @expand would be slightly better), there is also a concept of applying pdep/pext to each individual element of a vector. Although it has not seen implementation in hardware, it has been called for by some [1, 2]. While we are talking about analogous operations, I currently think it is necessary to keep them in their own boxes.

    1. How would this work for floats?

I'm not sure what you're asking.

    1. Do you see any usage or semantic problems/conflicts with other proposals, probably most noteworthy RFC: SIMD spec #9389 ?

Not really, but I found this interesting:

Recent SIMD architectures have native masks (AVX512, SVE, Risc-V V). However, their interface is not compatible with each other: SVE masks have 1-bit per byte, AVX512 (and maybe Risc-V V) masks have 1 bit per element, and emulated masks (for archs without masks like SSE or Neon) have 1 bit per bit. So it is impossible to have a single mask type for either all vectors of the same size, or the same length. It depends on the target arch.

That is why I believe we need a different mask type for every single vector type (hence the T in @Mask(N, T, ABI)). In theory, we might just need to know the number of bits of the elements and not their actual types, but I think Zig has no problem handling a gazillion types.

Also, we really need to deal with the ABI when defining the type as a @Mask(4, i32, .AVX512) would be a %k register, while a @Mask(4, i32, .SSE2) would be a %xmm register, despite masking in both cases a vector of 4 i32. Also, you might want to write code for different archs in the same binary (for dynamic dispatch for instance).

The layout of a mask is then opaque. The ABI tag can change the actual layout used.

Semantically, a @Mask(N, T, ABI) is the same as a [N] bool or @Vector(N, bool), but optimized for masking a @Vector(N, T, ABI).

Unfortunately I do not know enough about all these architectures to know how to square this circle. However, I agree with the conclusion reached by https://github.com/ziglang/zig/issues/7702 that intrinsics are sometimes necessary to get the full performance out of your CPU and cannot be expressed generically. Not every operation can be cleanly mapped to every architecture. There's no @bitReverse on x86, there's no 64-bit @popCount on aarch64. In contrast to the latter, AVX512 has extensions that allow vectorized @popCounts! On the latest hardware I can do 8 64-bit popCounts in one vector operation on x86 but on aarch64 it takes 8 individual 8-bit popCounts summed together to get a single 64-bit popCount. It's par for the course to have different performance characteristics on different architectures, even when just looking at the support for various instructions.

This proposal indeed adds yet another operation or two that will run better or worse on different architectures. I'm suggesting adding these operations because they are generally useful, difficult or impossible to implement efficiently without being blessed by the compiler, and common. Also, these operations should be encouraged in my view. I believe the example code I gave is a compelling strategy and it makes for good software in my view.

Although I mentioned a direct mapping to VPCOMPRESS instructions specifically, I currently do not own a machine that supports them, and I read that even on Zen 4 machines that support VPCOMPRESS it is actually more efficient to emulate the instruction, particularly when writing to a memory destination. Some have speculated it got a slow microcode implementation due to a bug. Therefore, this proposal is not merely about getting access to VPCOMPRESS instructions without assembly. It's about adding a vector operation that's useful in general, that might map to any number of different implementations. Of course, I hope the proposed operations are as efficient as possible on each architecture. But again, this is not just an intrinsic call, so it might not be as efficient on some architectures and that's okay.

sharpobject commented 1 year ago

packSelect is a very useful operation, but it seems like it will be difficult to get consensus on an implementation if it's a builtin. Some of the fastest implementations of this operation on targets like AVX2 involve giant lookup tables of tiny lookup tables.

Validark commented 1 year ago

packSelect is a very useful operation, but it seems like it will be difficult to get consensus on an implementation if it's a builtin. Some of the fastest implementations of this operation on targets like AVX2 involve giant lookup tables of tiny lookup tables.

I'd be interested in any links you might be referencing. To me it seems like Zig at the moment should be fine with giant lookup tables in ReleaseFast, since it forces the use of O3 anyway and thus automatic loop unrolling is done everywhere by default, so code size reduction seems like less of a priority in general. I think this technically would come from the data cache but I think the same principle applies. It would be nice to have more control over these things though, one day. https://github.com/ziglang/zig/issues/978 might provide a decent way to switch between different implementations of packSelect.

Validark commented 1 year ago

I heard about simdprune but for some reason didn't link it here. It's an implementation that should be considered.

matu3ba commented 1 year ago

Some of the fastest implementations of this operation on targets like AVX2 involve giant lookup tables of tiny lookup tables.

Do you have any numbers or can you link any representable numbers to extrapolate on this?

Validark commented 1 year ago

Some of the fastest implementations of this operation on targets like AVX2 involve giant lookup tables of tiny lookup tables.

Do you have any numbers or can you link any representable numbers to extrapolate on this?

Of Daniel Lemire's implementations, the giant lookup table versions are the fastest in the benchmarks.

https://github.com/lemire/simdprune#how-fast-is-it

The tables file is here.

ominitay commented 1 year ago

If this were to be a builtin, I would argue that it should remain separate from the @depositBits and @extractBits operations. This would line-up with the existing pattern of builtins and operators operating in the same way on vectors as on scalars, element-wise.

I'm not 100% sure that this should be a builtin though, considering that it can be implemented in userspace as a library, like Daniel Lemire's implementation. Cnsidering this, I guess I'd want to ask: why should this be implemented as a builtin than a library?

Validark commented 1 year ago

If this were to be a builtin, I would argue that it should remain separate from the @depositBits and @extractBits operations. This would line-up with the existing pattern of builtins and operators operating in the same way on vectors as on scalars, element-wise.

This is what I argued for in my above comment.

I'm not 100% sure that this should be a builtin though, considering that it can be implemented in userspace as a library, like Daniel Lemire's implementation. Considering this, I guess I'd want to ask: why should this be implemented as a builtin than a library?

At the moment Zig lacks support for intrinsics, so there is no way to access LLVM's vector facilities except through builtin functions or through trying to write a function that LLVM can recognize as equivalent to an intrinsic (which does not work for everything you might want). But even once we have intrinsics, I still think these operations belong in the class of fundamental vector operations with @select, @shuffle, and @reduce. I don't think you should have to download a library to write code like this. I would even go as far to say that the examples I gave above should be encouraged by the language. Additionally, the operations in this proposal could dramatically benefit from some reasoning by the compiler to produce more optimized emit. The semantics of @packSelect are useful across a wide variety of circumstances but actually optimizing those cases by hand would not be worth it most of the time. If that burden is shifted onto the programmer, few people will even consider that such an operation could exist, and they will instead use more code to accomplish the task less efficiently. Take a look at the stackoverflow links in my original post. The operations proposed here are currently esoteric. They are not easily doable and not widely known. If Zig supported these operations as builtins, it would open the door for a lot of people to use them, it would be a better solution to the class of problems this solves, and it may even get you a nice performance boost on machines that support the requisite SIMD operations.

ominitay commented 1 year ago

I don't agree with the idea that the language should make something that could be done with the existing vector operations into its own builtin – the standard library is a better place for this, and there's nothing wrong with using packages for whatever isn't in the standard library.

However, I missed your earlier mention of VPCOMPRESS, so on further thought, I think I agree then that this should be a builtin.

Validark commented 1 year ago

https://llvm.org/docs/LangRef.html#masked-vector-expanding-load-and-compressing-store-intrinsics

flyfish30 commented 6 months ago

@Validark The packSelect function cannot be implemented using the following method. Can you provide detailed code?

However, this effect can already be achieved by creating a mask with std.simd.iota(u8, VEC_SIZE) >= @splat(VEC_SIZE, @as(u8, @popCount(quoted_mask))) and then doing a @select to move either src or @splat(VEC_SIZE, @as(u8, 0)) in the right places. Hopefully the optimizer will be smart enough at some point to know that that pattern still only needs 1 VPCOMPRESSB on AVX512_VBMI2 + AVX512VL x86_64 machines.

Below is the Zig code for the packSelect function I wrote according to the above method, but the result is incorrect.

fn vectorLength(comptime VectorType: type) comptime_int {
    return switch (@typeInfo(VectorType)) {
        .Vector => |info| info.len,
        .Array => |info| info.len,
        else => @compileError("Invalid type " ++ @typeName(VectorType)),
    };
}

fn VecChild(comptime T: type) type {
    return std.meta.Child(T);
}

pub fn packSelect(vec: anytype, mask: @Vector(vectorLength(@TypeOf(vec)), bool)) @Vector(vectorLength(@TypeOf(vec)), VecChild(@TypeOf(vec))) {
    const Child = VecChild(@TypeOf(vec));
    const vecLen = comptime vectorLength(@TypeOf(vec));
    const int_mask = @as(std.meta.Int(.unsigned, vecLen), @bitCast(mask));
    std.debug.print("packSelect int_mask is: 0b{b:0>32}\n", .{int_mask});
    const select_mask = std.simd.iota(u8, vecLen) >= @as(@Vector(vecLen, u8), @splat(@as(u8, @popCount(int_mask))));
    return @select(Child, select_mask, vec, @as(@Vector(vecLen, Child), @splat(0)));
}

Could you help me point out what's wrong with the above code?

Validark commented 6 months ago

@flyfish30 That quote was referring to this behavior:

x86_64 has a variant which fills the rest of the vector with values from another source and a variant which fills it with zeroes. ... However, this effect can already be achieved by ...

The best way to do this depends on your target CPU architecture, and at the moment, I have not written an extensive polyfill for packSelect (although I have done so for unpackSelect). What kind of CPU are you targeting?

But if I were you, I would start with https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask. A pext-based solution would be your best bet if you are on an Intel chip, Haswell (2013) or newer, or an AMD Zen 3 machine.

Otherwise, you could try generalizing this solution from that stackoverflow link:

inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
    const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
    const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
    const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);

    __m128 v = val;
    v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
    v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
    v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
    return v;
}

inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
    const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
    const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
    const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);

    __m256d v = val;
    v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
    v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
    v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);

    return v;
}

Another thing that is on my to-study list, is this mention in the risc-v vector ISA: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#158-vector-iota-instruction. For some reason, it explains:

The viota.m instruction can be combined with memory scatter instructions (indexed stores) to perform vector compress functions.

Which is weird, because it already has a vector compress instruction. However, it might be possible to do a SWAR emulation of the routine given there, but I just don't understand what magic the vsuxei32 instruction is doing. I would have to get a better explanation for how that works.

Is this enough information for you?

flyfish30 commented 6 months ago

@Validark Thanks for you help? I am writing a vectorized parallel program. This program is used for multiple CPU target platforms, so I want a packSelect function that can be compiled to multiple CPU target platforms. I found a lot of information about implementing the packSelect function, there has the lookup table method and the method of using pdex to implement it. The general solution you gave seems very simple, I am trying to use this method you provided.

flyfish30 commented 6 months ago

@Validark I used the general simd instruction to implement the packSelect function by table lookup method. The source code file is in bellow link, could you review the code? https://github.com/flyfish30/zig-basic/blob/main/src/pack_select.zig