ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.8k stars 2.47k forks source link

Extend slices to enable bit-granularity loads and stores. #2955

Open shawnl opened 5 years ago

shawnl commented 5 years ago

Current proposal


The Zig language does not need the concept of bytes. This is a implementation detail, and is unnecessary baggage and complexity, and reduces portability. The concept of bytes can be relegated to what @sizeOf() and @alignOf() return, and as we already have the align attribute on pointers, as long as every pointer has align of at least 8 (bits), then things mostly stay the same. Here is what changes:

Because @alignOf will always return a number divisible by 8 most things don't change. We also keep @byteSwap and @sliceToBytes and @bytesToSlice (and they would always refer to u8 types)

Out of scope of this proposal:

Arguments for:

Arguments against:

daurnimator commented 5 years ago

Would also remove @byteOffsetOf.

I actually don't mind this idea.

shawnl commented 5 years ago

@daurnimator yeah, it would become @offsetOf and return bits.

daurnimator commented 5 years ago

@daurnimator yeah, it would become @offsetOf and return bits.

We already have @bitOffsetOf

GoogleBot42 commented 5 years ago

Because zig has the concept of endianness, I think the concept of bytes is essential. For example, if the concept of bytes is removed, what does @byteswap do? Should it be removed just for the sake of removing the concept of bytes? Or is it going to be left in because it is useful? If it gets left in, that seems very inconsistent. If it gets taken out, that removes a useful feature.

Furthermore, it is far more common to be measuring the size of something in bytes. Specifically, @sizeof. It would be really, really annoying and confusing to people coming from c that sizeof doesn't even return the size in bytes. Since zig aims to replace c, I believe this goes strongly against zig's goals.

If someone needs to know the number of bits, it isn't hard to calculate and is a far less common use case.

shawnl commented 5 years ago

@GoogleBot42 bytes would remain where necessary, such as @byteSwap().

@sizeOf and friends would more generally and useful if it returns bits. (if albeit a tiny bit confusing when coming from C) The existence already of @bitOffsetOf is already testament to this.

Since zig aims to replace c, I believe this goes strongly against zig's goals.

I think having the core language have one less feature reduces the learning curve. This is the main reason go moved to a self-hosted compiler. Few people actually know all of C.

And, if you actually know C, they you would know that C support byte sizes other than 8. So these C programmers you speak of are not really programming C, they are programming to a specific implementation of C. (if you want to be pedantic).

If someone needs to know the number of bits, it isn't hard to calculate and is a far less common use case.

Yeah, but it only works if you have a whole number of bytes. This proposal was motivated by horrible experience trying to bit-pack when expanding vectors to support @bitCast https://github.com/ziglang/zig/pull/2945/commits/1fbc26ebe15e19ad1c153cbe8b2d3bf15e2d8931#diff-476daa305f4d991c0f097fc22a1c7ad6L24929 While this proposal does not discuss bit-granularity pointers, this proposal would make them obvious and not technical debt and complexity if added later. (Implementing them on current machines is non-trivial, but their impact on the language would be trivial).

Also, reducing the dependence on bytes is necessary if Zig is ever to be used for more esoteric programming, as a strong sense of bytes really holds back the general-purposeness of the language.

andrewrk commented 5 years ago

It would help this proposal considerably if there were a real world hardware still in use today that had non-octet-based "words".

shawnl commented 5 years ago

My recent LLVM codegen bug appears caused by the lack of bit-granularity addressing https://bugs.llvm.org/show_bug.cgi?id=42803

shawnl commented 5 years ago

Perhaps we need to port zig to MIX from The Art of Computer Programming. :) 6-bit bytes.

GoogleBot42 commented 5 years ago

Perhaps we need to port zig to MIX from The Art of Computer Programming. :) 6-bit bytes.

Maybe you weren't serious when writing this but if we are comparing to C, 6 bit bytes isn't allowed. C has specific requirements about chars that makes a 6 bit byte impossible in standard C. In C, bytes must have at least 8 bits according to the C99 standard (5.2.4.2.1).

implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign ... number of bits for smallest object that is not a bit-field (byte) CHAR_BIT 8

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

shawnl commented 5 years ago

Zig could support it. u8 types would probably be straddled between two bytes, with align(12) (2 bytes). @byteSwap and @sliceToBytes and @bytesToSlice would all refer to the u8 type, as Zig wouldn't actually have the concept of bytes.

shawnl commented 5 years ago

The hardware is FPGAs. Xilinx sells alot of FPGAs, and has a (proprietary) LLVM-to-verilog converter.

GoogleBot42 commented 5 years ago

Zig could support it

Only if LLVM does. I couldn't find a list of LLVM backends. I am doubtful that they support it though. Since <8 bit bytes doesn't work in standard C, there is little that would drive development for any byte size <8 bits in LLVM.

The hardware is FPGAs.

If Xilinx already has a converter... then why is this feature needed? Zig can output LLVM-IR which presumably Xilinx's converter takes.

shawnl commented 5 years ago
 If Xilinx already has a converter... then why is this feature needed? Zig can output LLVM-IR which presumably Xilinx's converter takes.

Because you couldn't optimize it right. I am presuming that they use pointer namespaces and actually have some RAM on these machines (but I could be wrong). We need someone that knows more about FPGAs and DSPs.

GoogleBot42 commented 5 years ago

Xilinx has likely invested a lot of money into the converter. Also, since this is a FPGA, there is much, much more flexibility and room to do tricks. I would need more than speculation to believe that it would be slower.

andrewrk commented 5 years ago

Only if LLVM does.

The language specification is not complete (see #75), but Zig the language does not depend on LLVM. LLVM is an implementation detail.

I couldn't find a list of LLVM backends.

You can see a list of backends that zig supports with zig targets.

kyle-github commented 5 years ago

IIRC, there are some DSP chips that use 16-bit words as the smallest addressable unit. That said, the majority of software out there would probably break if these units were not 8 bits.

I think you would run into more people getting confused (at least initially) by having all sizeofX functions returning sizes in bits.

What values do you pass to allocators? How do you take a size of a struct and turn that around and figure out what to pass to an allocator? Doesn't that surface the platform allocation unit size into the middle of everything? How would I port code without rewriting all the allocation and size calculation parts?

Maybe I am missing something here... I get the conceptual clarity, but is this really useful or will it be one of those, "pedantically correct, but the UX sucks" kind of things?

I could see matching C's definitions for the same reasons that C does it that way. Wouldn't it be better to have all of the sizing functions return allocation units and then have a nice little @-function that returns the number of bits in an allocation unit?

shawnl commented 5 years ago

What values do you pass to allocators? How do you take a size of a struct and turn that around and figure out what to pass to an allocator?

These are in bits. But often you would be doing @sizeOf(foo) * x.

Doesn't that surface the platform allocation unit size into the middle of everything?

No, because you always get your numbers from @sizeOf() and @alignOf().

How would I port code without rewriting all the allocation and size calculation parts?

You use @sizeOf() and @alignOf(). The main drawback of this proposal is the increase in memory use on 32-bit architectures if you are storing lots of usize (which is now u35) for some reason (pointers are still the same size, because their alignment reduces their size).

return allocation units

Except that they don't. The allocation unit of mmap is the page size (4KB or 64KB or 2MB), the allocation of the stack is usually 16-bytes, the allocation of an arbitrary type is @alignOf, and there are plenty of instructions that require natural alignment. (and this is a big problem with atomic operations). There is not really any single "allocation unit", and if you are compiling for FPGA (I believe) there is none at all.

I think you would run into more people getting confused (at least initially) by having all sizeofX functions returning sizes in bits.

User's from other compiled languages (esp. C), certainly. But I think it would actually be simpler for user's that are coming from scripting languages which don't have much of a concept of memory, or that are new to programming.

GoogleBot42 commented 5 years ago

There is not really any single "allocation unit", and if you are compiling for FPGA (I believe) there is none at all.

FPGAs that have memory usually have memory that works just like memory on other devices. It is common for there to be blocks of memory built into the chip (addressable not by bit but by byte) and/or external memory similar to GPUs. So the concept of bytes still applies to them. I'll grant that FPGAs can more easily hide this if desired. But that doesn't mean that changing Zig to remove the concept of bytes will see any benefit here since the FPGA compilation tools would need to support it anyway.

But I think it would actually be simpler for user's that are coming from scripting languages which don't have much of a concept of memory, or that are new to programming.

Maybe you are right. Although, I don't think that is Zig's target audience. Zig is aiming to replace C not dynamic languages where it will never be able to compete anyway because of differing design philosophies. If sight is lost of the target audience, the language can quickly grow into a monster. (Sadly, there are a lot of examples of this...)

shawnl commented 5 years ago

I'm not religious about this proposal or anything. There is a real down-size in that usize on 32-bit architectures gets bigger. The use case that led me to this was when the complication of doing bit-granularity loads/stores with optimizations (for @bitCast on vectors) make my code too complicated to use. C does not have odd-sized integer types like zig does, but to fully use zig's integer types we would need an option of bit-granularity pointers. If bytes are still part of the language, then such pointers would always seem like a cludge on the language, and I don't think they would have much chance.

kyle-github commented 5 years ago

Having usize on a 32-bit architecture use two words is a problem, but it seems like a fatal problem on the exact architectures that would possibly use this: small embedded "weird" chips, and FPGAs. The resources on those are very precious. Using 32 bits for an address on an 8-bit processor (well, you could use 24 but modulo 3 arithmetic isn't better) seems like it will result in peasants with pitchforks and torches in the night.

Perhaps there are two conflated things here. One is to have the concept of a bit pointer. I can see the use of that. It would be nice to have a pointer to a u3. The other is using "allocation units" in things like @sizeOf where an allocation unit is the smallest addressable entity directly supported by the CPU.

The first makes tightly packing strange integer sizes nicer (less code on the coder's part). The other makes it a little nicer to support CPUs with 12-bit "bytes" etc.

That said, I still think that:

  1. this is going to be confusing to most people. They are taught that a byte is 8 bits and that all memory is in those units. The x86 and ARM CPUs do not change that world view.
  2. the people that really need it are going to be used to writing special functions anyway because C does not give you bit pointers and often dealing with strange byte sizes means you write a lot of masking and shifting anyway.
  3. the increased pointer size for bit pointers is just as much fun as far pointers were back in the bad old days of segmented memory addressing (sizeof(void ) != sizeof(void far ), and may be fatal overhead on small systems.

What about instead of a bit pointer, you use a bit slice? A slice is already a struct with several elements. People understand that. Perhaps extending the idea of a slice would work? A slice of []u1? It gives plenty of opportunities for the compiler to use some of the bit field extraction opcodes that are now more and more common.

shawnl commented 5 years ago

What about instead of a bit pointer, you use a bit slice?

This is a really good idea, and fits the use case I mentioned. I am going to change this proposal to that once I get a sense of what it would look like. I don't want to ruin the type safety of []u1. So lets say it is []bit for discussion now:

var slice: []bit = @sliceToBits(in_slice);
slice[4] = a_u7; // Writes 7 bits: 0x11110000 0x00000111, this is straight forward (this is r->l bit ordering)
a_u7 = slice[4]; // This on the other hand is a bit funny, deducing the load width from the value being assigned to. I don't (currently) see a better way of doing this however.

This is rare enough that it is probably fine if you have to then manually add to the index the number of bits you wrote, despite that being pretty verbose, and lacking the beauty of pointer arithmetic. You could also just write an iterator to handle it.

iamfraggle commented 2 years ago

I was about to create a separate issue for this, but want to check if this issue covers my use case, inspired by day 3 of this year's Advent of Code.

My intended code is something like this:

const ItemType = u12; // My AoC file had 12 bits, would expect to be able to access irregular-sized types
const items = []ItemType { 0, 284 }; // etc.
var array = [_]u16{0} ** @bitSizeOf(ItemType);

for (items) |item| {
    for (@bitSlice(item)) |bit, num| {    // No current syntax for this?
        if (bit) array[num] += 1;    // Would a 'bit' be equivalent to a bool?  Is u1 equivalent to a bool?
    }
}

At the moment I'm using a while loop to increment a bit number variable and doing manual shifting:

for (items) |item| {
    var bit: u4 = 0;
    while (bit < bits) : (bit += 1) {
        if (item & (@as(ItemType, 1) << bit) != 0) array[bit] += 1;
    }
}

but the shift operator is very particular about the type on the rhs. The compiler will tell you what it should have been, but I always either forget, or just get it wrong, so I never get it right first time. With bit slices, I wouldn't need to know.