ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.92k stars 2.55k forks source link

Proposal: `ubyte`/`ibyte` for smallest addressable unit of memory #7693

Open ghost opened 3 years ago

ghost commented 3 years ago

Currently, it is assumed that all hardware uses u8 as the address unit, which is not universally true. However, this assumption is built into the language or to things fairly central to the language's use in places, such as @memcpy/@memset and the allocator interface. The language has a concept of address units in align/@sizeOf, but provides no direct interface to this.

I propose the simplest solution: ubyte and ibyte types, representing the smallest addressable memory size, scaling with the platform just like usize does. So, anywhere in the language where "raw memory" is required would use [*]ubyte rather than [*]u8, and be portable universally.

Alternative names: ucell/icell, unit/init (but pls no)

kyle-github commented 3 years ago

C has CHAR_BIT so perhaps something like that? Perhaps a const std.?cpu?.smallest_addressable_unit_bits? I definitely like the idea of being able to determine this in a well supported way, but I am not sure you need a type for it. All the other types, except usize are explicitly not platform-platform specific. This is just @floopfloopfloopfloopfloop's proposal in #5185.

SpexGuy commented 3 years ago

@memcpy and @memset are part of the language, so the language does need to define a type for this. Or we could use *anyopaque for those pointers instead.

ghost commented 3 years ago

Or we could use *anyopaque for those pointers instead.

I want to say this won't work, since anyopaque can't be dereferenced without a cast (a cast to...), but I'm not intimately familiar.

SpexGuy commented 3 years ago

It's fine since the language spec doesn't need to define the implementation of memcpy/memset, just the header. The fact that you can't implement it in a portable way is unfortunate but not required from the language itself, the std lib could fill that gap.

jayschwa commented 3 years ago

In Go, I do like having a distinction between byte and uint8. The latter signals that something mathy is happening, whereas the former is generally just unspecific "data".

One thing I don't understand about this proposal is why there should be two variants for signedness. Isn't having an opinion on the signedness of the value an indication that it should be an integer type?

ghost commented 3 years ago

It is an integer type. That's how we represent memory in Zig. We don't have a bit vector type, and I'm not proposing one.

daurnimator commented 3 years ago

It is an integer type. That's how we represent memory in Zig.

We actually may need to change that; it causes undefined behaviour in hidden and unexpected places. e.g. PackedIntArray is full of undefined behaviour. https://freenode.irclog.whitequark.org/zig/2021-01-02#28727221

03:32 \<andrewrk> daurnimator, the dissonance comes from simultaneously trying to treat an integer as a mathematical integer and as a bag of bits 03:32 \<andrewrk> there are other places where this dissonance shows up too and I agree it is something to consider

ghost commented 3 years ago

Ok, sure. But that's an orthogonal concept.

daurnimator commented 3 years ago

Ok, sure. But that's an orthogonal concept.

I don't know about that. What if we have a byte type that is just a bag-of-bits of the smallest addressable unit of memory.

ghost commented 3 years ago

We could do the bag-of-bits thing at any size: define bXX/bsize/bbyte, which can @bitCast to their u/i equivalents, with defined bitwise operations, and >>/<< are rotations rather than shifts. That may have merit, but again, it's orthogonal to the idea of an address unit type.

ghost commented 3 years ago

@EleanorNB, do you propose that byte should have arbitrary length, so that e.g. 9-bit aligned architectures à la GreenArrays and PDP-10 can be supported? Or is this only about hardware that requires aligned (16 bit, 32 bit, etc.) memory operations?

ghost commented 3 years ago

That is the idea, yes. Also, I'm not entirely clear on the difference between a platform that requires all memory accesses to be aligned to at least some size and a platform with that size byte. They are the same thing, are they not?

ikskuh commented 3 years ago

I think a byte type should have the size of the smallest possible address increment possible. Thus, on a bit-adressable hardware, byte would be 1 bit large, on HW that can only increment addresses in 16 bit steps, byte should have 16 bit.

If a project assumes that byte has 8 bits, it can always assert that in a comptime block:

comptime {
    if(@bitSizeOf(byte) != 8)
        @compileError("This project requires 8-bit bytes!");
}
ghost commented 3 years ago

Also, I'm not entirely clear on the difference between a platform that requires all memory accesses to be aligned to at least some size and a platform with that size byte. They are the same thing, are they not?

They are. I was just subtly hinting that no additional language features should be necessary to support hardware that has multiple-of-8-bits addressable units :smile:. It would probably be a matter of setting the proper alignment in the allocator interface and taking a minor performance penalty when working with u8 and i8.

The more general case of arbitrarily-sized bytes is a different matter, though. A native byte type would be rather more useful in that case, but I'm not sure it would be useful enough to justify its addition to the language. After all, the problem of odd byte sizes is not limited to allocators and memcopy. Arithmetic and custom data types would also have to be converted to the appropriate integer sizes to generate efficient code.

In other words, either you don't care about performance very much, in which case you simply let the compiler generate the appropriate mask/unmask operations all over the place and get portability for free. Or, you do care about performance, in which case you need to write your program specifically using i7, i14, i28 and whatever else is appropriate. But then you lose portability.

It might be possible to write width-agnostic code (using only byte and isize) that will be clear, correct and performant whether it runs on an 8-bit machine or 7 or 9 or 12 or whatever. But I'm really not sure anybody would bother to write such code, considering that 99,999% of all hardware works with multiples of 8 bits.

kyle-github commented 3 years ago

I like the symmetry of the "bag of bits" data type with the i and u types. I.e. i1, u1, bits1; i2, u2, bits2... i16, u16, bits16...

Going off on a tangent...

It might be nice to extend that symmetry to other bag-o-bits types. The generic bag of bits above could be a base type to which all the others coerce automatically. The bag of bits type(s), bitsX, could have just the bit-wise boolean operators and shift/rotate. Coercing up into the integer or other bag-of-bits types would require @bitCast. You can remove type info automatically but you cannot add it automatically.

var tmp_bits: bits15 = undefined;
var tmp_int:u15 = 42;

tmp_bits = tmp_int; // works by automatic type coercion
tmp_int = tmp_bits; // ERROR: not allowed, assume tmp_bits was set to some valid value.
tmp_int = @bitCast(u15, tmp_bits); // yep, you know what you are doing

const tmp_float: f32 = 3.141592;
var tmp_bits:bit32 = tmp_float; // yes, fine.   Floats are just bags of bits in a register like the others.

Addresses might be another one as a base type for pointers. All pointers could coerce automatically down to that type. The base unit would be @EleanorNB's address unit type in this proposal. As above, to coerce to a typed pointer, you would need to use @bitCast. This replaces void* and many uses of char* in C. It would be usable inside allocators and other code that needs to deal with raw addresses. Can be constrained in size and bit patterns (x86-64) depending on the platform.

And I'll stop there with the tangent.

ghost commented 3 years ago

@kyle-github, your idea may be related to (#7512), which proposes to separate signed, unsigned and modular integers from bitstrings. Sure, doing things like that has a certain elegance from the point of view of symmetry and separation of concerns. But what do we really gain from this, other than additional @casts all over the place?

kyle-github commented 3 years ago

EDIT: PEBKAC problem using @bitCast in cases where I meant @intCast. Sigh.

@zzyxyzz thanks for the pointer to the other issue. It seems vaguely related in so far as it proposes new numerical behavior. But that is not what I am thinking about here. This is sort of a half-baked idea, so hopefully the stuff below makes sense!

A key point is that raw bits should have relatively few allowed operations. Mostly masking and bit-level operations like and, or, xor. Shifting makes some sense as does rotation. Arithmetic operations on raw bits make no sense. There is no interpretation of meaning beyond 1 and 0 and position for each bit.

As to casts, I think there would be relatively few. The advantages I see are:

The first point is pretty much what we have now, but with guarantees about the results of operations across all platforms. That said, the exact operations and guarantees can be carefully chosen to make sure no one is surprised and implementation on common platforms is not overly painful (Java is both a good and bad example here). If I use an i8, I should get exactly the same behavior across all platforms regardless of what the CPU wants to do. It might be costly, but I should get identical results on all platforms.

The second point has two advantages. The first is that you have a way to cleanly drop to the platform representation via automatic down coercion. If you need to get access to the raw bits without Zig applying an interpretation or restrictions on them, it is trivial to do. That should reduce @bitCast usage not expand it. I should be able to do var tmp:bits64 = 3.141592; and then fiddle with the binary representation of the floating point number as much as I want. The second is that you have a really clear indication in code that you are doing something platform-specific. A bit like you can search for unsafe in Rust to find sharp edges.

Quick example: Suppose I am building a VM and I am going to use NaN boxing. In current Zig you would need to @bitCast the f64 to a u64 so that you can apply masking (oh, and this is completely platform-specific since some platforms do stupid things with IEEE floating point and byte ordering). However, since u64 is an unsigned integer, you can also shift it, add to it, divide it by 6 and do all kinds of things that make no sense to a floating point bit string. If you screw up in your code, Zig is going to happily help you pull the trigger as you aim the gun at your foot.

Once you pull out the payload, you will need to either @intCast it again to something like u48 or accept that your cleaned up u64 can have illegal values for your payload. So either you use bit casting or you open yourself to accidental footguns. A True Believer(tm) C coder would of course use u64 and just remember to check, after every operation on the bits, that the result was still a legal payload. Zig can do better, but there is nothing helping you remember to do so.

If you drop it down (without needing a cast) into a raw bit type, then you can mask it and extract the parts you need and then do a single, final @intCast to u48 and treat the payload the way you did before. And you will need to use that @intCast to a u48 to process the data since you cannot do much at all on raw bits. So Zig is helping you by forcing you to be more explicit in how you want to treat that data. You actually lose a call to @bitCast at the beginning and gain a bit of a trigger guard on that set of footguns. You would not be able to accidentally divide the raw bits by 6. You will not be able to use the u48 to generate an illegal payload.

Sorry, I meant to keep this short! I need AA for long-posters :-(

ghost commented 3 years ago

I must admit I still don't get what the advantage of a raw memory type would be.

Quick example: Suppose I am building a VM and I am going to use NaN boxing. In current Zig you would need to @bitCast the f64 to a u64 so that you can apply masking (oh, and this is completely platform-specific since some platforms do stupid things with IEEE floating point and byte ordering).

Do they? I was under the impression that IEEE-754 defines a precise bit encoding. Some architectures are not fully IEEE-754 compliant in other areas, such as rounding modes or handling of subnormals, but I'm not aware of architectures that change the encoding. (I could be wrong about this. Are there any?) Byte order should also not be relevant, since it applies to memory layout and not to in-register values. If you @bitCast an f64 to u64, there shouldn't be any surprises.

However, since u64 is an unsigned integer, you can also shift it, add to it, divide it by 6 and do all kinds of things that make no sense to a floating point bit string. If you screw up in your code, Zig is going to happily help you pull the trigger as you aim the gun at your foot.

If you do bit-twiddling on float values, you are indeed on your own. But once you decide to do it, working with bit strings isn't really much safer than working with unsigned ints. Besides, most bit-twiddling operations require both logical and integer operations. For example, ~extracting~ removing the least-significant set bit is done with (x-1) & x; creating a mask up to the n-th bit is done with (1<<n) - 1, etc. If you segregate arithmetic and bitwise operations into separate types, this will only add unnecessary casts, without adding safety.

Once you pull out the payload, you will need to either @bitCast it again to something like u48 or accept that your cleaned up u64 can have illegal values for your payload. So either you use bit casting or you open yourself to accidental footguns. A True Believer(tm) C coder would of course use u64 and just remember to check, after every operation on the bits, that the result was still a legal payload.

What do you mean? Once you extract the payload and cast it to the appropriate type, all's good, no?

kyle-github commented 3 years ago

I must admit I still don't get what the advantage of a raw memory type would be.

Quick example: Suppose I am building a VM and I am going to use NaN boxing. In current Zig you would need to @bitcast the f64 to a u64 so that you can apply masking (oh, and this is completely platform-specific since some platforms do stupid things with IEEE floating point and byte ordering).

Do they? I was under the impression that IEEE-754 defines a precise bit encoding. [ snip ... ] If you @bitCast an f64 to u64, there shouldn't be any surprises.

You would think so, but no. See endianness and floating point in Wikipedia. Some processor use little-endian ints and big-endian floats or vice versa. YMMV. I have hit this before and it sucks. I think I hit it on MIPS.

However, since u64 is an unsigned integer, you can also shift it, add to it, divide it by 6 and do all kinds of things that make no sense to a floating point bit string. If you screw up in your code, Zig is going to happily help you pull the trigger as you aim the gun at your foot.

If you do bit-twiddling on float values, you are indeed on your own. But once you decide to do it, working with bit strings isn't really much safer than working with unsigned ints. Besides, most bit-twiddling operations require both logical and integer operations. For example, extracting the least-significant set bit is done with (x-1) & x; creating a mask up to the n-th bit is done with (1<<n) - 1, etc. If you segregate arithmetic and bitwise operations into separate types, this will only add unnecessary casts, without adding safety.

I disagree on the safety, but your points about common bit-twiddling are important ones. That said, most of this kind of bit twiddling should be intrinsics or compiler built-ins as a lot of CPUs now have single instructions for them and people should not be doing it themselves. If I saw this in new code today I would consider it a code smell.

For instance, a lot of masking is due to a lack of easy bit-field extraction operations. I have seen cases where using type punning into bit-field structs in order to extract fields ends up producing better code. I do some low level protocol programming right now (not my day job) and I am so trained to do it as you note that I do not even notice when the compiler supports a better/higher level way to do the same thing. Part of the idea here is to make people like me who have done this too much think again :-)

Once you pull out the payload, you will need to either @bitcast it again to something like u48 or accept that your cleaned up u64 can have illegal values for your payload. So either you use bit casting or you open yourself to accidental footguns. A True Believer(tm) C coder would of course use u64 and just remember to check, after every operation on the bits, that the result was still a legal payload.

What do you mean? Once you extract the payload and cast it to the appropriate type, all's good, no?

You left off the final sentence of that paragraph:

Zig can do better, but there is nothing helping you remember to do so.

Zig does not help your remember to do that last cast to u48. It is that last part that I am trying for. Make it both the easiest path to do the right thing and make it harder to do the wrong thing. I see a lot of code (to be clear, in other languages) that isn't doing that final @intCast.

Let me make it clear: I am not sure that there is either sufficient benefit to this to justify it or that there is any interest, so I really appreciate the time you are taking to think about this and respond. Your point about bit-twiddling brings up some good cases.

daurnimator commented 3 years ago

But what do we really gain from this, other than additional @casts all over the place?

The key issue I ran into was that if any bits are undefined in an integer, the whole integer is undefined. This means that right now in zig: @as(u8, undefined) | 0x0f | 0xf0 is an undefined value. This comes up as a bug today in e.g. PackedIntArray. The solution right now is to (instead of u8) use packed struct { @"0": u1, @"1": u1, @"2": u1, @"3": u1, @"4": u1, @"5": u1, @"6": u1, @"7": u1} and operate on each individual bit. My proposed "bag-of-bits" type would have undefined-ness on the per-bit level.

kyle-github commented 3 years ago

Interesting, @daurnimator! The undefined bits would come from things like padding or unused bits in an int that was not a machine word size?

I wonder if the behavior of undefined should continue to be like SQL's NULL. For instance, OR of a 1 against any value is 1, so should OR of 1 against and undefined bit still be undefined? Similarly AND with 0 will always return 0. Maybe that is too complicated...

ghost commented 3 years ago

Hey everyone, the bag-of-bits idea is *completely irrelevant to this proposal. Stop discussing it here. Go to #8388 which is specifically about that.

matu3ba commented 2 years ago

@EleanorNB Could you add some examples for platforms, which are not using 8bit as "smallest addressable memory size", and where they are still in usage with the specific advantage?

I would assume the main blocker for this are non-LLVM backends, as LLVM does not support non-8bit char. So any sort of performance guarantees are already impossible.

It would be also useful to estimate how this would affect the complexity and performance of the compiler.

ghost commented 2 years ago

@matu3ba Historically, DEC mainframes were famous for using odd word sizes, such as 9/18/36 or 12 bits. Of still produced hardware, Chuck Moore's GreenArrays 18-bit Forth processor comes to mind. Needless to say, this represents a very small niche. Other than that, local memory on GPUs and other coprocessors may be addressable in multi-byte chunks only.

Additional reading: [1] Wikipedia: 18-bit computing [2] RFC 4042: UTF-9 and UTF-18