Proposal: Represent integer endianness in the type system

ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

https://ziglang.org

MIT License

32.16k stars 2.35k forks source link

Proposal: Represent integer endianness in the type system #18796

Open rohlem opened 5 months ago

rohlem commented 5 months ago

(EDIT: Finished with initial editing: Added comments on foreign-endian arithmetic and non-byte-aligned @bitCast-s that I thought of too late.) (EDIT2: Finished (I think) restoring the post (EDIT3: again) after GitHub gave me a stale edit buffer, swallowing over half of my edits. q.q)

The type system is a useful tool for communicating, between programmers as well as to the compiler, the intent of what is meant by data. This can be important within a single code base, but is even more important at interfaces between modules. Clearly denoting where we expect multi-byte integers to be in little- or big-endian format (regardless of the host's endianness) would help discover mismatches.

The basic, minimal version:

Introduce a std.builtin.Endian field to std.builtin.Type.Int and std.builtin.Type.ComptimeInt. Integer types that differ in endianness are distinct, and values do not freely convert between them; a @bitCast is required.

This would already give interfaces (say packed struct-s representing binary formats) the required expressiveness. For minimizing the implementation effort and language change, we can forbid all arithmetic operations on foreign-endian integer types (at least initially)

Further ideas (optional):

Introduce shorthands for referring to these types (suggestion: allow suffixes be and le after current (u/i)N names, so f.e. u16be, i32le). The current type names still exist and alias the host-native variant (with endianness @import("builtin").target.cpu.arch.endian()).
Introduce a builtin @endianCast for integer arguments, which performs a @byteSwap on the input to output the inverse-endian type.
- Since other casts already use the result location type in their behavior, we could also support the input and output type being of same endianness and make it a no-op in that case. I guess this could be useful to shorten generic code.
Extend @intCast and @truncate to work on foreign-endian integer types (output's endianness staying the same as the input's). Seems like they would be useful, and not too difficult to implement.
Allow arithmetic operations on foreign-endian integer types. This would unburden users from frequent @endianCast-s back-and-forth, essentially inserting them behind-the-scenes. Since there's an ever-so-slight performance penalty to the required @byteSwap though, I think we can keep burdening users with explicitly writing out the @endianCast.
- Just for completeness, there's also the compromise option of allowing inputs to be any endianness (even mixing little and big), but always yielding a native-endian results.
We could deduplicate endianness for all integer types with no more bits than a byte, since endianness doesn't affect them.
- (This is similar to a separate topic: The fact that usize != std.meta.Int(.unsigned, @bitSizeOf(usize)) in status-quo, which imo is suboptimal because it may lead to needless function instantiations. But if that's not an obvious drive-by improvement I can open a separate proposal for separate discussion.)

EDIT4: Probably not an issue, but what to do about non-byte-aligned-sized integers?

The occupied bits in the representation of non-byte-aligned-sized integers differ:

On little-endian hosts, u9 occupies only one bit in the low-address byte. (bit numbering: 12345678,XXXXXXX9)
On big-endian hosts,u9 occupies only one bit in the low-address byte. (bit numbering: XXXXXXX9, 12345678)

Originally I thought that @bitCast-s would have to specifically handle this case, however I am no longer convinced this is an issue, because the value's byte-representation is only ever observed in external interfaces (callconv(.C) functions, extern struct/extern union), which currently only allow byte-aligned types, and when accessing memory via pointers, where we already insert bit-padding to fill out @sizeOf(T) bytes. (EDIT5: I previously mixed up bit-shifting and byte-swapping at this point and freaked myself out about pointers. But because the bit-order within a byte isn't observably different between endiannesses, there are no bit-shifts necessary and I see no more issue, even with pointers.)

jayschwa commented 5 months ago

Can you elaborate on a use-case for this? I think it's a best practice to only handle endianness at I/O boundaries (e.g. file or network streams), and store all integers in memory as native endian.

rohlem commented 5 months ago

@jayschwa Repeated from the post above, a use case would be describing a binary format (for example a file header or network protocol) using a packed struct. In status-quo, you can add comments noting endianness, but have to manually make sure your code @byteSwap-s the right fields at the right time (or not, depending on the host architecture). Using endian-aware integer types as fields, the compiler can tell you when accessing these foreign-endian fields incorrectly (and @endianCast would provide a very easy-to-use solution for conversions in these scenarios).

jayschwa commented 5 months ago

So you'd like to be able to:

const Header = packed struct {
    foo: u32le,
    bar: u32le,
};
const header = try reader.readStruct(Header);

versus what is done now:

const Header = struct {
    foo: u32,
    bar: u32,
};
const header = .{
    .foo = try reader.readInt(u32, .little),
    .bar = try reader.readInt(u32, .little),
};

Is that correct?

rohlem commented 5 months ago

@jayschwa Sure, that's one usage example that would be improved. Serializing/writing out a value of the type, and accessing the fields individually would similarly be less prone to bugs.

squeek502 commented 5 months ago

Another use-case would be UTF-16. From https://github.com/ziglang/zig/issues/649#issuecomment-1680235966 (a pointer endianness proposal):

One potential use case for this that I've been running into lately would be UTF-16. Being able to have a []endian(.Little) u16 slice that (1) handles littleToNative/nativeToLittle conversions for you, and (2) allows the endianness of some UTF-16 data to be retained & enforced by the slice itself seems like it'd be quite useful.

In this proposal's case, that'd be []u16le instead.

RossComputerGuy commented 5 months ago

Endian as an attribute alongside callconv is more appealing imo. u32le/u32be, u16le/u16be, and so on do look nice but I don't think would be as flexible. Endian as an attribute could allow for some function created types to change the endian based on an argument.

nektro commented 5 months ago

As someone who's now written a handful of parsers and file format handlers in Zig I agree with jay's original comment that it makes the most sense imo to only deal with endianness at the io boundary and keep integers that in-memory native endian

zzyxyzz commented 5 months ago

Re binary format parsing: the field order of packed structs depends on platform endianness too, so for that to actually work, you'd need a flag to override that as well.

rohlem commented 5 months ago

@RossComputerGuy I'm not sure I understand your point - after https://github.com/ziglang/zig/issues/11834 callconv willl not have comptime arguments in scope anymore, so it will be impossible to decide it on basis of a comptime argument (though you can always add a proxy function as indirection). EDIT: I guess you would be proposing a second endian(x) attribute to functions? And that would only affect argument types? And maybe insert a single auto-conversion? With endian-aware argument types you can start with endianness as comptime argument and decide further argument types based on that. You'd also be allowed to have argument types from different endianness - for example a payload that is written somewhere can be in the intended target endianness and all other arguments can be host-native.

@nektro Regarding the repeated "only at I/O" boundary comment - the purpose of this proposal is exactly to make it easier to express endianness at these boundaries. To me the most logical place to document this would be in the type system, not just doc comments that you hope people get right. If you can think of a better alternative, or can think of some downside of implementing it this way, please share. packed struct themselves are similarly most useful at binary interface boundaries, and still in the language's type system to enable this expressiveness.

@zzyxyzz What packed struct does is order fields from the first field starting in the lowest-value bits to the last field in the highest-value bits of the underlying backing integer. Because the address-order of lowest vs highest value bytes flips on little vs big endian, that means that the addresses of the respective bytes change. This proposal would actually allow quite an easy (and for my understanding intuitive) way of converting between them - the flag you mention would be the backing integer's endianness. Not by any extra logic mind you, it just arises automatically from the fact that this is what endianness is and how byte-swapping works.

Here is a (longer) code example of how endian-aware backing (naturally) works:

```zig // bits in comments numbered starting with bit 0 (so second byte is range [8, 15]) pub fn Example(comptime BackingType: type) type { return packed struct(BackingType) { a: u3, //stored in lowest-value bit range [0, 2] b: u15le, //stored in bit range [3, 17] c: u10be, //stored in bit range [18, 27] d: i4, //stored in highest-value bit range [28, 31] }; } const ExampleLE = Example(u32le); const ExampleBE = Example(u32be); //note: exact same definition, but with different backing type // example values (@endianCast here inserted assuming we're starting from little-endian) const example_le = ExampleLE{ .a = 4, .b = 0xABC, .c = @endianCast(0x1EF), .d = -1, }; const example_be = ExampleBE{ .a = example_le.a, .b = example_le.b, .c = example_le.c, //type system sees they're both BE, otherwise @endianCast would be required .d = example_le.d, }; // workaround for status-quo // (note: this means the value of example_be is actually laid out incorrectly in memory) //const u10be = u10; //const u15le = u15; //const u32le = u32; //const u32be = u32; const std = @import("std"); const expectEqual = std.testing.expectEqual; //note that endianness is part of the backing type //(intFromPacked/packedFromInt as builtins or in std.meta would make usage even simpler) const BEBacking = @typeInfo(ExampleBE).Struct.backing_integer.?; const LEBacking = @typeInfo(ExampleLE).Struct.backing_integer.?; test "read individual bytes" { // both values hold same value in their 1st- to 4th-valued byte, // those bytes are just at different addresses. const ptr_le: *const [4]u8 = @ptrCast(&example_le); const correct_be: BEBacking = @bitCast(example_be); //enabled by this proposal // workaround for status-quo: without the proposal, example_be isn't laid out in memory correctly //const correct_be: BEBacking = @endianCast(@as(BEBacking, @bitCast(example_be))); const ptr_be: *const [4]u8 = @ptrCast(&correct_be); try expectEqual(ptr_le[0], ptr_be[3]); try expectEqual(ptr_le[1], ptr_be[2]); try expectEqual(ptr_le[2], ptr_be[1]); try expectEqual(ptr_le[3], ptr_be[0]); } test "read full backing" { const be_backing: BEBacking = @bitCast(example_be); const le_backing: LEBacking = @endianCast(@as(BEBacking, be_backing)); //enabled by this proposal // workaround for status-quo: without the proposal, example_be isn't laid out in memory correctly //const le_backing: LEBacking = @endianCast(@endianCast(@as(BEBacking, @bitCast(be_backing)))); var another_le: ExampleLE = @bitCast(le_backing); try expectEqual(example_le, another_le); } /// workaround: replace all @endianCast-s with this function for running under status-quo. /// Note that this function ALWAYS SWAPS, /// whereas proposed @endianCast would be smart enough to know when to swap /// (would either no-op or lead to a type error if the endianness doesn't change) fn flexibleSizeByteSwap(x: anytype) @TypeOf(x) { if (@TypeOf(x) == comptime_int) return flexibleSizeByteSwap(@as(std.math.IntFittingRange(x, x), x)); const byte_aligned_size_bits = ((@typeInfo(@TypeOf(x)).Int.bits + 7)/8) * 8; const ByteAlignedSizedInt = std.meta.Int(.unsigned, byte_aligned_size_bits); const swapBuffer: ByteAlignedSizedInt = x; return @truncate(@byteSwap(swapBuffer)); } ``` This works under status-quo once you activate / swap in the code next to the 4 "workaround" comments, which also means replacing every `@endianCast` with a call to `flexibleSizeByteSwap` (which always swaps, so I placed it only where required running on a little-endian host). As stated, under status-quo `ExampleBE` has no endian-aware backing, so the global `example_le` will not be laid out in memory correctly; I've tried to find the most readable workarounds in the tests which still work under proposed semantics. Without this proposal there's no way for the compiler to tell you where you should or should not insert swaps (because we don' tell it via the type system). With this proposal, the compiler would tell you exactly where a swap is missing or done in error (because it knows via the type system).

zzyxyzz commented 5 months ago

@rohlem Yeah, specifying the endianness of the backing integer should solve this problem. My bad.

nektro commented 5 months ago

Regarding the repeated "only at I/O" boundary comment - the purpose of this proposal is exactly to make it easier to express endianness at these boundaries. To me the most logical place to document this would be in the type system, not just doc comments that you hope people get right. If you can think of a better alternative, or can think of some downside of implementing it this way, please share.

The source of this information is in the specification for the format/protocol and stored at the module level in Zig not the type system. eg a format/protocol isnt gonna have mixed endianness throughout the duration of those reads and is totally unnecessary to be exposed to the user, eg https://git.sr.ht/~nektro/magnolia-desktop/tree/1ddabe69/item/src/e/Image/qoi.zig#L142-144. this file is called through mag.e.Image.qoi.parse(allocator, path); and the user never needs to know or care that the data being read happens to be big endian.

One could imagine me adding a .write method later but similarly it would accept a generic Image struct and the endianness being written out to the respective stream is an implementation detail localized to that file.

packed struct themselves are similarly most useful at binary interface boundaries, and still in the language's type system to enable this expressiveness.

packed structs are indeed great but most of the usages of them in these comments have been wrong imo and act as a negative towards this proposal.

mochalins commented 5 months ago

Commenting that this would be extremely helpful for my use case, as originally proposed by @rohlem (packed struct backing integers and all). We use Zig for firmware in custom devices that must interface with external memory on-board. Some of these external memory interfaces (EMIFs) require LE, others require BE. Would be a huge QoL improvement to be able to define these interfaces directly as packed structs within the type system, without any need for endian conversions. I'm not sure if this is a usage you would consider wrong @nektro , but as of now using packed structs for EMIFs that matched our controller's endianness has turned out to be one of the few major advantages in practice that Zig has brought over C (11, perhaps 17/23 might have different tradeoffs if we could have used them).

blanham commented 3 months ago

Agree with @mochalins that this feature would be great for firmware and kernel devs (and likely emulator authors as well). I wanted to add that GCC added an attribute for this called scalar_storage_order[1] that implements functionality not too dissimilar to @rohlem 's proposal.

[1] https://gcc.gnu.org/onlinedocs/gcc-8.5.0/gcc/Common-Type-Attributes.html

clickingbuttons commented 1 month ago

Regardless of whether its done implicitly or explicitly (I'd much prefer explicit) this design leads to @byteSwaps on every field access. That's generally the same cost as an add or sub.

If your intent is to specify how to encode and decode your struct type, I think field tags are a better solution. Chances are you're going to have to handle struct alignment and other metadata there anyways.

const Header = packed struct {
    foo: u32,
    bar: u32,

    /// Use `@hasDecl` to discover this in your reader/writer impl and call `@byteSwap` appropriately.
    pub const endianness = .{
        .foo = .big,
        .bar = .little,
    };
};

because the value's byte-representation is only ever observed in external interfaces

It's also observed in std.mem.asBytes. Given this proposal I'd expect that function to NOT perform a byte swap.

rohlem commented 1 month ago

Regardless of whether its done implicitly or explicitly (I'd much prefer explicit) this design leads to @byteSwaps on every field access.

@clickingbuttons if I understand your comment correctly, you're suggesting that identifying the endianness of field values via types would introduce more conversions to their values. I don't think this is the case: Unless you use values in arithmetic operations where their endianness matters (so excluding bit-operations), the compiler can keep propagating values in foreign-endianness types. The only change of behavior can happen at boundaries where you want to use values in native-endian format, and the change is that the compiler points out where you forgot to byte-swap. Nowhere else should additional byte-swaps (have to) be introduced.

If your intent is to specify how to encode and decode your struct type, I think field tags are a better solution.

Introducing a separate mechanism means that every call site has to respect this additional mechanism, otherwise it silently misuses the values (leading to endianness bugs). By putting the information into the type system the compiler will point out these bugs to users. (As a side note, if you think you're better off without the feature, you should be able to not use it in your code base.)

because the value's byte-representation is only ever observed in external interfaces

It's also observed in std.mem.asBytes. Given this proposal I'd expect that function to NOT perform a byte swap.

Right, the last point isn't about byte swaps but bit usage within those bytes. Because non-byte integer values are (currently) padded with high-value 0 bits in Zig (so all u9 values have the same in-memory representation as their values as u16), whether those high bits are in the lowest- or highest-address byte on a given host architecture changes with its endianness. This is something the Zig language in status-quo already does, and I imagine would still do even when endian-tagged integer types were introduced according to this proposal. (Note that this happens for all types, so a non-integer packed struct with a size of 9 bits uses the same bits.)

If you care about these padding bit locations, even in status-quo you already have to round up to the next-larger integer type (u9 -> u16, or u24 if you want 3 bytes, or u32 if you want 4 bytes, etc.) and additionally left-shift the value (bit-value-upwards) by the added bit count. My main reason for suggesting keeping this behavior the same is that (afaik) currently all Zig types work this way, so to stay compatible with the representation of all other types, foreign-endian integers would have to behave the same way. Otherwise we would have to add special-case logic for foreign-endian integers, which I imagine would be detrimental to performance in @bitCast-s. Note that most ABI-oriented use cases will already use byte-aligned types (multiples of 8 bits), so they are not affected by this discrepancy.

clickingbuttons commented 1 month ago

Unless you use values in arithmetic operations

This is where you are mistaken. Every load and store MUST potentially @byteSwap. What do you expect the following code to print on a little endian machine?

const a = std.fmt.parseInt(i32be, "131072"); // on store, this value MUST be byte swapped or it'd be an `i32` instead of `i32be`
std.debug.print("{d}/n", .{ a }); // on load this value MUST be byte swapped or it will print the wrong value

rohlem commented 1 month ago

This is where you are mistaken. Every load and store MUST potentially @byteSwap.

@clickingbuttons The byteSwap does not occur on every access. The value is kept in whatever endinanness it was stored in memory. This is the same as in status-quo, without endian-tagged integer types. This proposal is for the type system to additionally keep track of this information. Using a foreign-endian value in contexts that need native-endian integers, such as native arithmetic operations, the compiler can point out the discrepancy and the user can handle this however appropriate - for example by @endianCast which leads to a @byteSwap (or by @bitCast if they actually wanted to reinterpret the value). I tried explaining that in the OP, but I guess it isn't as clear as I wanted it to be.

What do you expect the following code with to print on a little endian machine?

Your "on store" example could either be a compile error pointing out the endianness mismatch, or the implementation of parseInt could perform the @endianCast leading to a @byteSwap. Similarly in your "on load" example, std.debug.print is passed a value of type i32be. I would assume that we would introduce logic that swaps it to native endianness for displaying, but if you wanted you could also leave that unimplemented and instead panic, telling users only native-endian integer types are supported. Or you could implement it by @bitCast and display the endian-swapped value, as would happen without endian-tagged integer types in status-quo.

In status-quo, you can already remember all the necessary places to @byteSwap, or you can forget some. With this proposal, the compiler checks that you do not forget them. You need to mark them with either @endianCast to convert (= @byteSwap like in status-quo), or @bitCast to reinterpret (= no-op like in status-quo).

clickingbuttons commented 1 month ago

I understand you want a le/be tagged integer type for the purpose of the compiler ensuring you remember to cast from/to native integer types. Having implemented big endian network protocols on little endian machines, I sympathize.

My arguments are:

These types are only useful at IO boundaries. Converting from/to native types there ONCE is better than infecting the whole type system and converting on EVERY usage of the field besides as an opaque blob of bytes. This type of conversion doesn't need be or le types, it just needs std.mem.toNative.
Such functionality isn't much more useful than a container that wraps some bytes with a defined endianness and a getter/setter that calls std.mem.toNative. You can create such a container today. For the types to be truly useful for general programming (instead of just type poison babysitters for IO) I think this proposal should define what intrinsics, std.math, and std.fmt functions should do with these types.

rohlem commented 1 month ago

@clickingbuttons

These types are only useful at IO boundaries.

You're free to use them only at these boundaries. I don't anticipate using them very often either.

better than infecting the whole type system and converting on EVERY usage of the field

By now I hope you understand my point that only uses which need to perform conversion would perform it, so I'll stop reiterating it. Reading "I am mistaken" in your previous comment I expected your point to be a technical shortcoming, when now it reads more like "you shouldn't use this for fields you intend to repeatedly access", which I fully agree with you on.

For a use case where this would make sense, let's consider a program that listens to a lot of mixed-endian data, and then only has to use a handful of data points from it. Maybe something like message length always makes sense to decode for understanding the structure of the stream, while most of the fields can stay foreign-endian until needed. The relation might be something like reading a couple kiB from a MiB to GiB data stream. I assume you would agree that it would be suboptimal to convert all input to native endian once read/received. I additionally think it would be a good usage of the type system (whether builtin or userland wrappers) to remember endianness.

The reason for a language proposal then is that I think it's simple enough that it would be useful without being too much added complexity to the compiler. I expect 99% of code to keep using uX without endianness suffix and not even notice a difference.

Such functionality isn't much more useful than a container that wraps some bytes with a defined endianness and a getter/setter that calls std.mem.toNative.

The main advantage I see is:

having standardized vocabulary; compare with Zig's built-in :0-terminator in arrays and slices. You can do this in userspace, which leads to slightly worse ergonomics, and may lead to multiple versions used by different projects. The language providing one solution means we solve this use case once, and every project uses it, given that we can find a design that satisfies everyone (and so far I think we can.)
(better ergonomics = improved readability: @endianCast could be written in generic code and only byte-swap if the result location type requires it)
if, at some point, we want to add support for foreign-endian types in certain operations, we would be free to do that. For example, bit operations (&|^~), assignment and equivalence (=, ==) work regardless of endianness, so there's already no reason to disallow them. Maybe some backends find ways of merging the byte-swap into the larger arithmetic operation that makes it more efficient. I don't personally foresee writing such optimizations, but this proposal would open up the option for the future.

For the types to be truly useful for general programming (instead of just type poison babysitters for IO) I think this proposal should define what intrinsics, std.math, and std.fmt functions should do with these types.

I don't see it as critical to the fundamental proposal, and as a user I don't strongly care. (Meaning if compiler-supported endian-tagged integer types got into the language today, I would use them regardless of whether intrinsics and std support them or force me to convert beforehand.)

If you want my opinion though: Intrinsics should error and tell the user to @byteSwap or @bitCast, std functions should @endianCast = byte-swap. Users should be aware what foreign-endian types are, and that they have to be byte-swapped for arithmetic, so if they want to improve performance they should reduce their usage to the minimum (i.e. a single conversion on first usage).

We could also add a flag allow_foreign_endian to std.options that configures std behavior. Then if someone wants to enforce this, they set the flag to false and all foreign-endian arguments passed to std trigger compile errors.

zzyxyzz commented 1 month ago

What I don't quite like about this proposal is that it attaches endianness to integer types directly, even though it's not really a property of the integers themselves, but only of their storage layout, and as such only makes sense within packed structs. Everywhere else, this type annotation would either do nothing (in the best case) or introduce unnecessary byte swaps where the compiler is unable elide them (which should be pretty much everywhere except in-register operations).

In my opinion, endiannness should be more like an alignment annotation:

const Header = packed struct {
    a: u16 endian(.big),
    b: u16 endian(.big),
};

Then the compiler could automatically byteswap if necessary on reads and writes. Whether this is actually desirable is another question, though. On one hand, this is clearer and safer. On the other, this will probably cost extra byteswaps in most cases. Lazy conversions are only more efficient if you expect to touch a few fields, otherwise it's probably better to do the entire conversion at the IO boundary, as others have pointed out.

rohlem commented 1 month ago

the compiler could automatically byteswap if necessary on reads and writes. Whether this is actually desirable is another question, though.

@zzyxyzz By keeping the information as part of the type, it is more explicit and unnecessary conversions are easier to spot/avoid. That is why I would prefer it, and proposed it this way.

It's also more flexible in that every usage site can decide whether to use @endianCast (which byte-swaps if necessary) or @bitCast (which interprets the bits as a no-op without byte-swapping). I currently don't see a convenient way to model this when the information is only attached to the location.

[...] only makes sense within packed structs. Everywhere else, this type annotation would either do nothing (in the best case) or introduce unnecessary byte swaps

Say you want to cache an ID that is conceptually a foreign-endian number, which you intend to send/write multiple times, more often than using it as native-endian in local logic. (Perhaps the endianness can be useful to know when debugging, or in some ordering logic, safety checks, etc.) Maybe in your understanding that means the ID should always be wrapped in a packed struct. In my mind it's simpler to say the field stores the ID value as a foreign-endian integer.

I agree that the most common use case is for specifying interfaces using packed struct. I think ergonomics of these use cases would be improved by the proposal. I also think there can be legitimate use cases of retaining foreign-endian values for longer than that, which is why I don't think it makes sense to special-case the mechanism too much.

In my eyes (and obviously I'm biased) my proposal is at its core already very simple. Maybe by discussing we'll still find something even simpler, or come across some complexity / drawback that I've overlooked. The idea of making it a location attribute looks both more complex and less flexible to me however. (For example we'd also have to specially integrate it with pointers to integers. If used outside of packed structs, there's propagation to arrays and vectors. I think distinct types avoid any special casing here.)

zzyxyzz commented 1 month ago

@rohlem Just for clarification, how did you intend foreign-endianness integers to be represented internally? 1) Should they be kept in non-native order, with byteswaps before and after every arithmetic operation, but not for loads and stores, or 2) represented as native integers with ordinary arithmetic, but automatic byteswaps upon writes to arrays and packed structs?

rohlem commented 1 month ago

@zzyxyzz I think that information is detailed in the OP, but to clarify: Endian-tagged integers are represented in memory in the endianness they are tagged with. Today foreign-endian data is not tagged in the type system. The whole point is to add this information.

As stated several times now (including in OP), "automatic byte-swaps" are a non-goal. Users should use an @endianCast builtin (= @byteSwap if the endianness changes between source and destination), or use a @bitCast (opting into reinterpreting the data and disregarding endianness).

If maintainers decide that we should allow foreign-endian arithmetic (which I doubt as it can lead to unwanted performance degradation if the users don't realize they're using foreign-endian types) then those would be allowed I guess. Because the endian tag of the type is meant to signal the byte order in memory, that would then probably lead to your option 1).

Though keep in mind that the toolchain can always optimize everything under the as-if-rule. If it internally forces all non-packed struct fields to use native-endianness (your option 2) ), as long as the behavior is the same it's allowed to do that. I believe this is highly unlikely for 2 reasons though:

Such an optimization would go against the explicit intent of the programmer stated in the source code, which feels contrary to the zig zen.
Integer types currently have a defined memory layout. If we really really want the toolchain to be able to do this transformation, we would have to specially assign an undefined memory layout to foreign-endian integers, so their bytes are no longer legally inspectable via @bitCast / @ptrCast. That would seem highly unintuitive to me, and in my eyes complicates the language.

zzyxyzz commented 1 month ago

@rohlem Thanks for the explanation, this was not quite clear to me from the OP. Would the following be a correct description of the intended semantics?

// Native endianness: Little

const S = packed struct {
    lil: u16 = 0,
    big: u16be = 0,
};

fn foo() void {
    const a: u16be = 1
    const b: u16be = 2;
    const c: u16 = 3;
    var s: S;

    const p = a + b; // error
    const q = a + c; // error
    const r = a + 1; // error
    s.lil = a; // error
    s.big = a; // no error, simple store
    s.big = 1; // no error?
}

Basically, everything is an error except storing something in a location with the same type and endianness.

Note: edited const r = a + 1 example.

rohlem commented 1 month ago

@zzyxyzz Yes, that would be my personal preference. For consistency I think comptime_int should also gain _be and _le variants, which would lead to s.big = 1 also being an error and requiring an @endianCast, for consistency with runtime-known integer types.

zzyxyzz commented 1 month ago

@rohlem

For consistency I think comptime_int should also gain _be and _le variants, which would lead to s.big = 1 also being an error and requiring an @endianCast, for consistency with runtime-known integer types.

Then you would be unable to initialize fields and variables without an @endianCast either. And I don't think there's a practical case for tracking the endianness of comptime_int anyway.

Another option to consider is not aliasing the normal integers with the native endianness variant. (I.e., making u16le distinct from u16 on LE platforms and disallowing arithmetic on it as well). Otherwise you would introduce gratuitous incompatibility between platforms -- u16 and u16le would be interchangeable on LE, but the latter would fail to compile on BE, even if there is no reason for it. This would force you to either not use endianness annotations at all, or use them in a deliberate and portable manner.

BTW, I'm not yet in favor of this proposal, just thinking out loud :)

rohlem commented 1 month ago

Then you would be unable to initialize fields and variables without an @endianCast either. And I don't think there's a practical case for tracking the endianness of comptime_int anyway.

@zzyxyzz I don't see an issue with requiring an @endianCast to initialize foreign-endian integers, but I don't currently know of a use case for endian-aware comptime_int either. I do see regularity between comptime_int and sized integers as a worthwhile goal though. (I similarly already use UnsignedComptimeInt and SignedComptimeInt wrapper types in userland, though granted that's a quite esoteric situation.)

Another option to consider is not aliasing the normal integers with the native endianness variant.

Arguments I can think of against doing this:

adds a third option to the endianness field in builtin.Type.Int (or makes it optional)
passing u16, u16le, u16be as anytype argument would now create 3 instead of just 2 instantiations, even though the native-endian one would presumably generate mostly-identical machine code (see also the OP comment on type deduplication)

Otherwise you would introduce gratuitous incompatibility between platforms

Note that this is already the same thing that happens when f.e. using platform-specific APIs: Other build configurations' code paths aren't analyzed, so errors aren't spotted until you build for that configuration.

This would force you to either not use endianness annotations at all, or use them in a deliberate and portable manner.

The universally-correct approach should be to use @endianCast (~~or in rare instances @bitCast~~ wouldn't apply here I assume) on every change between specified- and unspecified-endian types, regardless of what the native endianness is, so that it works everywhere. (Note that @endianCast without changing endianness between source and destination type is intentionally allowed and treated as a no-op.) Every incompatibility pointed out by the introduced compile error can be resolved this way, so this process should be rather straightforward. (Of course, the alternative of re-designing code to use fewer of these @endianCast transitions is in contrast a more involved process.)

expikr commented 1 month ago

Would it make sense to have big/little endianness denoted by the capitalization?

U8/I8: Big Endian u8/i8: little endian

RossComputerGuy commented 1 month ago

@expikr Then how would you denote native endianess?

kanashimia commented 2 weeks ago

As an example Linux kernel uses aliases like __le32, __be32 alongside __u32 for readability in some parts of the code base. https://github.com/search?q=repo%3Atorvalds%2Flinux+%28__be32+OR+__le32%29 Just a type alias, but they are marked as __bitwise and there is an external tool that can analyse them https://en.wikipedia.org/wiki/Sparse#Examples

       -Wbitwise
              Warn about unsupported operations or type mismatches with
              restricted integer types.

              Sparse supports an extended attribute,
              __attribute__((bitwise)), which creates a new restricted
              integer type from a base integer type, distinct from the
              base integer type and from any other restricted integer
              type not declared in the same declaration or typedef.  For
              example, this allows programs to create typedefs for
              integer types with specific endianness.  With -Wbitwise,
              Sparse will warn on any use of a restricted type in
              arithmetic operations other than bitwise operations, and
              on any conversion of one restricted type into another,
              except via a cast that includes __attribute__((force)).

              __bitwise ends up being a "stronger integer separation",
              one that doesn't allow you to mix with non-bitwise
              integers, so now it's much harder to lose the type by
              mistake.

              __bitwise is for *unique types* that cannot be mixed with
              other types, and that you'd never want to just use as a
              random integer (the integer 0 is special, though, and gets
              silently accepted iirc - it's kind of like "NULL" for
              pointers). So "gfp_t" or the "safe endianness" types would
              be __bitwise: you can only operate on them by doing
              specific operations that know about *that* particular
              type.