Proposal (HARD MODE): Bit vector type (bag-of-bits)

ghost commented 3 years ago

Extracted from discussion in #7693 (such an idea has also shown up in other discussions). cc @daurnimator in particular.

There are two things which Zig is currently unable to represent cleanly:

Values with some defined and some undefined bits (can be done with packed struct for known-length values, but is incredibly verbose and does not scale)
Bit rotations (can be emulated with a sequence of shifts and masks, but the optimiser may have trouble recognising such a sequence and reducing it to a single machine instruction)

The second point could be solved with a builtin, but the first goes much deeper, effectively requiring a new type. I propose exactly this, or rather a new type family, in the signed/unsigned integer tradition: bit vector, a bag of bits with no arithmetic structure, written as bXX (vector of XX bits, up to 65535), bsize (vector of {word width} bits), or bbyte (vector of {byte length} bits; see #7693). Such a type may be @bitCasted to/from an integer type of at least/at most the same length, respectively (as an integer type has a canonical method of extension, whereas a bit vector type does not); it will not coerce either way, see below for explanation.

A bit vector may be used with bitwise operators &, |, ^, ~, but may not be used with arithmetic operators +, -, *, /. Shift operators >> and << take an integer on the right and are interpreted as rotations, i.e. bits shifted off one end are shifted onto the other[^2]. The bits of such a type may be defined or undefined independently: (@as(b8, undefined) & 0xf0) & 0x0f evaluates to 0 rather than undefined as would the analogous expression with u8.

A bit vector may also be indexed or sliced, for bit test/set and packed fields, but only with comptime-known indices: bXX[n] (single index) is an assignable bool, bXX[n..m] is an assignable b{m - n}. Concatenation/repetition is also possible[^3], for instance to construct a repeating bit mask; in this case the bit vector operands need not be comptime-known, but the multiplier must. Bits are numbered in integer significance order, that is v[0] is the LSB of @bitCast(usize, v)[^1].

Peer type resolution works reversly from integers: instead of automatically upcasting, a bit vector will automatically downcast; that is, b5 & b3 will produce a b3 (bits are matched by index: (b5 & b3)[0] == b5[0] and b3[0] and so forth). This is because, unlike signed and unsigned integers, there is no obvious way to extend a bit vector[^4].

Real Use Cases?

Currently, we use [*]u8 to represent raw memory, which has two major issues:

(not relevant here; addressed in #7693) It assumes 8-bit bytes, and will break on machines where this is not the case
It operates on the level of bytes: in particular, undefined granularity only extends to the byte level, which causes all kinds of problems for packed data (bit fields, the various uses of PackedIntArray)

Discussions in multiple places have touched on the idea of a bag-of-bits type to address this issue; c'est ici.

But Why A New Feature?

There is no way, no how, no dice to represent a scalable bit-level-defined value in current Zig. It simply cannot be done. Integers are scalable, but only in bytes (unless you want to pack them, into...), and go all undefined together; packed structs can have definition on any level you like, but are incredibly verbose for this use case and not scalable by any means except perhaps @Type jiggery-pokery. This is a tangible, useful feature, exactly suited to Zig's problem domain, and there's just no way to do it without a new feature.

On a deeper level, the use of bit vectors will typically produce code on the order of single machine instructions; working around the lack of such hardware-level features with existing features could perhaps be done, but if each resulting operation only takes 3 steps, that's a 3-fold performance hit. The purpose of Zig is to generate machine code; if there are certain basic machine operations that cannot be represented, that's a deficiency.

Why Not Just Rework `&`, `|`, `^`, `~`?

Because then every integer will have to track defined/undefined bits in safe modes, and we still don't have rotations.

This Proposal Breaks:

Noisily

Code that uses the names bXX, bsize, bbyte as identifiers

Silently

Nothing.

Musings

[^1]: In the first draft of this proposal I defined index order in line with platform endianness, i.e. on a big-endian machine v[0] would instead be the most significant bit of @bitCast(usize, v). This would match packed struct behaviour, and increasing indices would be stored at non-decreasing byte addresses; however there would then be horrific inconsistencies with rotations and concatenations, see below. [^2]: I chose >> and << to be interpreted as rotations as I believe these to be the only meaningful shift operations on non-numeric bit data. The goal of maintaining commutativity with @bitCast as well as consistency with bit indices between targets is the primary motivation to index bits in significance order. [^3]: I was on the fence about including these, but again, they have legitimate use cases and add no language complexity. [^4]: This, together with the lack of a meaningful way for a bit vector value to "overflow", means there is no need for a comptime_bit_vector type.

SpexGuy commented 3 years ago

There is no way, no how, no dice to represent a scalable bit-level-defined value in current Zig. It simply cannot be done.

This is true, but why do you want to do this? Undefined is about eliding copies, and copies are not expensive here.

RogierBrussee commented 3 years ago

See also #7512, #7605 (and possibly #8196)

I like most of this proposal.

I think having rotations and bit operations like inserting and extracting a set of bits based on a bitmap, extension by zero or the last one defined on them (RISCV has put in a lot of effort to come up with a finite set of sensible ones https://github.com/riscv/riscv-bitmanip/blob/master/bitmanip-draft.pdf) is a great idea, but I think they should be @functions like @rotl, @rotr. Using << and >> for rotations seems just confusing and shifting bitmasks is a fairly common thing to do. One could define them efficiently such that << on b32 would have a shift amount would only take the last 5 bits into account, i.e. is effectively an integer mod 32, << on b64 would have a shift amount that only takes the last 6 bits into account, i.e. is effectively an integer mod 64 etc.
Edit: Using the lowest bits for shifting does lose the useful mathematical property that (v << n) << m == (v << (n + m) ) however !!!!!

I think enums based on bXX that can be ored (using |) together make perfect sense.

lemaitre commented 3 years ago

I really like the bit types and I think they are worth it.

However, I have some reserves about >> and << operators. First, we should not change the semantics of well-known operators. Those are shifts inserting zeros (or sign-bit for signed >>).

If we need something else, let's define something else.

There exist a generic shift operation that covers both regular shifts and rotations, but also have some extra use cases: it is a "funnel shift" (term mainly used by Nvidia). It takes two integers (bit bags on this case), concatenate them, shift the new bit bag by the requested amount, and take lower bits.

Here are some strawman syntax to explain: a >>> b >>> n

right shif: 0 >>> x >>> n
right rotate: x >>> x >>> n
left shift: x >>> 0 >>> (32 - n)
left rotate: x >>> x >>> (32 - n)
arithmetic right shift: (x < 0 ? -1 : 0) >>> x >>> n

alinebee commented 3 years ago

What about <<% and >>% for bitwise rotation, to match the wrap-on-overflow syntax for arithmetic operations (+%, *% etc)? These rotation operations could then support integers as well, without changing the meaning of the existing << and >> operators.

topolarity commented 2 years ago

It seems like this issue is working to improve two distinct functionality gaps:

Limited language support for packed vectors/arrays/slices (PackedIntArray provides array and slice support, with some drawbacks)
Limited operations for vectors of bools

I like the motivation for each of these, but I wonder if the first issue would be better solved with broader language support for bit-level alignments and packed arrays/vectors.

For example:

We could support the packed modifier on an arbitrary existing type, so that the above types become [N] packed bool, Vector(N, packed bool), etc.
We could allow a bit-level alignment align(0) to be associated with a type, enabling [N] align(0) bool, Vector(N, align(0) bool), etc.

Either of these would remove the need for PackedIntArray and family.

If we had a natural way to represent a packed boolean vector, then the operations proposed here seem a lot more natural, and as a nice bonus, they don't require introducing any new types.

topolarity commented 2 years ago

To clarify the intended semantics a bit:

packed T would be valid for any primitive T (e.g. bool, u32, u7, f80) and would endow the type with zero natural alignment

This is equivalent to align(0) T, if we could override the natural alignment of a type. On the other hand, attaching alignments to types would also allow over-aligned arrays like [N] align(4) Foo

ziglang / zig

Proposal (HARD MODE): Bit vector type (bag-of-bits) #8388

Real Use Cases?