Even more integers but stricter integer operators

ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

https://ziglang.org

MIT License

33.6k stars 2.46k forks source link

Even more integers but stricter integer operators #7512

Open RogierBrussee opened 3 years ago

RogierBrussee commented 3 years ago

Zig already has 20 different basic integer classes. It seems Zig inherited from C that an integer is characterised by signedness, size, and it adds a a further attribute "length is determined by C-ABI, explicit, or pointer size". On these integers the main operators are

+, -, , / (can overflow) %+, %-, % %/ (does not overflow) &, | , ^, <<, >> ==, !=, <, <=, >, >= ! %

lots of @functions including arithmetic with explicit overflow e.g.

@addOverflow(T; type, a:T, b:T, r:*T) bool

This is clearly inherited from C with the exception of the modular operator %+, %-, %*, which are Zig specific

I propose to separate (potentially overflowing) arithmetic from modular arithmetic and from bitwise operators of different sizes and let them be defined on their own integer type

i8/16/32/64/128 --> +, - , , ! , @divTrunc(), @divFloor(), @rem(), @mod(), ==, !=, <, <=, >, >= (signed comparison) u8/16/32/64/128 --> +, - , , /, ! , %, ==, !=, <, <=, >, >=, (unsigned comparison) m8/16/32/64/128 --> %+, %-, % *, ==, != , !

Update: % as an operation from say m32 -> m32 makes no sense for modular ints, but conversions like m32 --> m17 are well defined and harmless and can be expressed with @as(m17, n) Update: modular division is slightly subtle see open issues.

b8/12/32/64/128 --> &, |, ~, ^ , <<, >>, @shra(),@rotl(),rotr(), (+ more bitops), ==, !=, !

Mathematically this is natural: it makes a distinction between the integers Z which are approximated by i64 (in the same sense that real numbers are approximated by f64), the integers modulo 2^64 Z/2^64Z which can be exactly represented by the 64 bit in m64, and a 64 bit bit vectors bitvectors {0,1}^64.

Some open issues: naming: m8/16/32... , b8/16/32/... is very minimalistic, explicit modular8/16/32/.. and bits8/16/..can also be argued for.

Notation: Is there still a need for the notation %+, %-, %* when the the type already makes explicit that modular arithmetic is asked for. This is less explicit but not different from the overloading of +,-, ... between the different integer types and floating point.

Intermediate types and (in)equality Intermediate length types work just fine. E.g. a type m3 would simply be represented by 8 or 32 bit bit with arithmetic operators. The %+, %-, % * (see below for %/) work just fine using the corresponding arithmetic twos complement operators (because that is modular arithmetic!) as long as (in) equality is properly defined i.e. e.g. in m5

a == b

is equivalent to

(a - b) == 0

which is equivalent to

@intCast(b8, a - b) &0b11111 = 0

which is e.g. equivalent to

@intCast(b32, a)&0b11111 = @intCast(b32, b)&0b11111

Conversion: Conversions like i32 --> m32 and u32 --> m32 are a nop on the bitlevel and respect +, -, *, !=, == so it is harmless. Even conversions like m64 --> m32 or even m32 --> m5 are harmless and can be a nop if one does lazy normalisation and only does it when (in)equality needs to be computed, so conversions i64 --> m5 are like wiseharmless.

The converse conversion m32 --> i32 is sign extension, while m32 --. u32 is zero extension so may be non trivial, and in any case represents a choice, mainly what one means with <. Thus it requires a cast.

It would suggest that i32--> b32 and vice versa also both require a cast. This separates the bitops and artihmetic operations and gives extra type safety to things like flags argument of a system call.

Semantics of %/ vs @modDiv Mathematically, the precise semantics of %/ when dividing by a power of 2 is iffy. It should overflow just like dividing by 0 overflows (i.e. is undefined behaviour) (However e.g. if x: m64 and x == 0 in m32, pow(2, 32). then x %/ @pow(2,32) is well defined in m32. Hence probably better to have @modDIv() (which uses the Euclidean algorithm and can overflow), and @modDivExactPow2()

Bitops The b bitfield type has << and >> shifts. The latter is logical right shift. The arithmetic shift is provided as @shra. (Arguably << and >> are better off and less error prone as @shl(), and @shr() (or even more arguably @shiftup, and @shiftdown), and be just one of the more commonly used bit operations).

Powers and xor. The above makes mixing arithmetic and bitwise operators unpleasant, because you have to cast. The compiler should have no trouble changing a:u32; = a % @pow(2, 12) (or a: u32 : = @as(m12, a)) to bit operations under the hood. ( However, IMO, the @pow(2, 32), while not terrible, is less than optimal. The obvious solution is to use ^ as the power operation on signed unsigned and modular integers and use @xor() (or # or an infix @xor@ or :xor: or just xor) for the xor operation. Zig uses 'and' and 'or' for logical and and or and they are a lot more common than xor)

comparison ops @ltu(), @leu(), @lts(), @les() for modular ints: The Boolean function fn ltu(n:m32, m:m32) bool{ @intCast(u32, n) < @intCast(u32, m)} is of course perfectly well defined but has no good properties like a < b implies a + c < b + c, and undefined behaviour should not sneakingly give those, so an intrinsics for undefined and signed comparison are probably a good idea.

Rogier Brussee

ityonemo commented 3 years ago

I kind of love this proposal.

If this gets accepted, I would suggest that the "standard integer ops" should allow >> as a "fast, never checked, division for powers of two", and divide should NEVER optimize division by two at compile-time. Possibly also << as a "fast (but checked) multiply for powers of two", noting that in most architectures there is not a difference (but this distinction is important in say embedded systems).

Open questions: should bitwise << >> over/underflow with %>> and %<< being checked?

ityonemo commented 3 years ago

ps @RogierBrussee some of your asterisks are disappearing on account of github turning them into italicize directives.

andrewrk commented 3 years ago

This is proposing to reverse #159 which has been thoroughly considered already, so it's unlikely to be accepted

SpexGuy commented 3 years ago

What is the range of m32 and b32? Signedness seems like it's orthogonal to wrapping/two's complement behavior, so you would need im32, um32, ib32, and ub32. This just seems silly to me.

This proposal is suggesting to add a significant set of new rules to the language. To merit that complexity, it needs to prevent footguns or grant significant new power. It's not clear to me that it does that. It also does great damage to ergonomics. Here is an example of a useful routine I wrote recently to increment a value within a group:

const Orientation = enum {
    Up,
    Right,
    Down,
    Left,

    FlippedUp,
    FlippedLeft,
    FlippedDown,
    FlippedRight,

    // before this proposal:
    pub fn clockwise(self: Orientation, amt: u2) Orientation {
        const int = @enumToInt(self);
        // increment bottom 2 bits, keep top bit.
        const rotated = ((int +% amt) & 3) | (int & 4);
        return @intToEnum(Orientation, rotated);
    }

    // after this proposal:
    pub fn clockwise(self: Orientation, amt: u2) Orientation {
        const int = @enumToInt(self);
        // increment bottom 2 bits, keep top bit.
        const rotated = @intCast(u3, (@intCast(b3, @intCast(m3, int) + amt) & 3)
            | (@intCast(b3, int) & 4));
        return @intToEnum(Orientation, rotated);
    }
};

It's not true in general that an integer is fully one of (adding, modulating, bitwising). No hardware in use these days imposes this restriction. I don't see any footguns that would be avoided by splitting these types.

RogierBrussee commented 3 years ago

There is no such thing as a signed or unsigned m32 because modular arithmetic is like computing on a clock, for any a b: m32, adding 1's to a or subtracting 1's to a both get you to b. Hence there are by design no < , <= and >, >= operators defined on m32. In other words the range of m32 is
{0 mod 2^32 ,..., 2^{32}-1 mod 2^32 } == {-2^31 mod 2^32 ,...-1 mod 2^32 , 0 mod 2^32 , 1 mod 2^32,.. 2^31 -1 mod 2^32} because these sets of modular numbers are equal mod 2^32. Likewise b32 is not ordered so it is equally does not have signedness (it just has a bit 31 which hardware may test on easily), (and of course the elements are enumerated {0b00...00, 0b00...01, .... 0b11...11} )

NOTE here I use the notation 2^32 == pow(2,32)

The Orientation example just shows that things get more painful if you insist on going back and forth with the binary representation of an integer. This is of course a useful technique at times, but this proposal insists that you make using the binary representation of an u32 or i32 integer explicit by doing a cast to a b32.

It is also true that hardware does not impose these restrictions: hardware just transforms one bit pattern into another, and in fact what the hardware calls add is modular add. Thus if you start from the hardware, an int is just a bit pattern. C famously started out that way, and also identified integers with addresses. It was then found to be useful to abstract to pointers. One could argue that b32 should be C's arbitrary bit patterns, and allow both the arithmetic operations and bitwise operations. Arithmetic is a after all a bit operation as well. However I thought it would be more useful to have a bit more type safety for things like this:

fn foo(index:u32, flags:b32) void

so YMMV, I can see arguments either way.

Update: Using %+, %- as notation for the bit operation of two's complement arithmetic should give you much the same level of type safety, as modular arithmetic as modular types should be rare.

Now indeed the rotation example gets a little more painful, but precisely because the wind rose is naturally using modular arithmetic mod 4 one can certainly do better than just casting up and down. Here is how I would write your example using this proposal.

` fn clockwise(self: Orientation, amt: m2) Orientation{ const int = @enumToInt(self); const direction:m2 = int; const flippant = int/4;
const rotated = @intCast(u3, direction + amt) + flippant*4; return @intToEnum(Orientation, rotated); }

tecanec commented 3 years ago

Seperating an int-like type without any operands besides == and != could make sence as a container for generic data such as that which is managed by a std.mem.Allocator. Like others, I do disagree with seperating bit arrays from integers, however. There are just so many places where applying binary operations onto integers makes sense and is useful.

RogierBrussee commented 3 years ago

Here are some Practical advantages of this proposal (see also some smaller updates on the original proposal)

for no bitoperations and no modular arithmetic on u8 and i8

consider the following expressions

n: u32; _ = n & (n - 1) // Pick out least significant bit
n: u32; _ = n << 3 + 1 // 8*m + 1
n: i32; _ = n >> 2 // n / 4
n: i32, _ = n & (-1 +% 1 << 5) // n % 32
n1,n2,n3, n4: u32; _ = 3%n1 +% 4%n2 - 5%n3 +% 6%n4 // modular arithmetic

each of these expressions is subtly buggy in different ways.

1: has undefined behaviour for n == 0 (unlike the C version which is n & (n -% 1)

is actually equivalent to _ = 16*n
There is no such thing as n/4 but, for good reason, only @divTrunc(n,4) and @divFloor(n,4), and n>>2 == @divFloor(n,4)
is actually equivalent to _ = n and does not make explicit whether @mod or @rem is wanted.
does not do modular subtraction so may overflow and has therefore undefined behaviour (unlike the full modular version)

Even Kernighan and Ritchy warn for doing this in C because it is error prone because of precedence rules. See also https://wiki.sei.cmu.edu/confluence/display/c/INT14C.+Avoid+performing+bitwise+and+arithmetic+operations+on+the+same+data

Note that in example 2 to 4 the bit manipulations are completely unnecessary: the compiler can change

8*n + 1 @divFloor(n, 4) @mod(n, 32)

to bit operations as an optimisation just fine.

For having modular types m8, m16,...

Modular arithmetic (aka 2-complement arithmetic) has no undefined behaviour and guaranteed associativity, so if appropriate, it is actually nice and efficient datatypes to use, and essential when said guarantees are necessary for correctness. The language should make it easy to use but clearly distinguishable, and steer you away from pitfalls, notably forgetting to write +% etc instead of +, and that using comparison is like comparing angles, so has to be done with care. -For some things like ringbuffers, the "circular" nature of modular integers is perfectly natural.

for having a bitfield datatype

Bit operations are clearly useful, they should just not be mingled with integers and arithmetic operations: see above. If they are mingled with arithmetic operations at all, these should explicitly be only two complement see above. A richer set of bit and bitfield operations make hacks like

bits & (bits %- 1) (e.g. @lowestBit(bits: b32)) or bits |= 0b1 << (n -1) (e.g. @setBit(bits : b32, u5: n) bits &= (0b1 << 5)-1 (e.g. @zeroExt(bits: b32, 5))

unnecessary, so making the need to mingle arithmetic and bit wise operations very rare, and possibly easier to optimise for modern architectures where these operations map to hardware bit operations. And if you really really have to mingle you can still do it using casting.

It clearly specifies intent that a bitfield is intended to be used. For example the ubiquitous flags field in systemcalls should be bitfields.
An

const FooFlags = enum(b5){Flag0, Flag1, Flag2, Flag3, Flag4, LowFlags = Flag0 | Flag1}

is a natural way to define an enum that is supposed to be used as flags.

Rogier

kyle-github commented 3 years ago

@RogierBrussee I do like the bit types. I mentioned something like this in other issues. I was thinking more along the lines of bag-o-bits types where you do not want to be able to do arithmetic operations on them. For instance, taking bytes out of a utf8 stream. Those are raw data, not unsigned 8-bit integers. Or dealing with user address pointers while in kernel space (you definitely do not want to allow dereferencing).

RogierBrussee commented 3 years ago

@kyle-github Thanks for your reaction.

I think utf8 bytes as b8 is reasonable, especially because you should not be comparing them with < (because they are not glyphs). But you would still want to know those bytes encode text in utf8 encoding with perhaps something like an @encoding attribute

const utf8 == b8 @encoding(.Utf8); const latin1 == b8 @encoding(.Iso8859_1);

with @TypeOf("ß") == [3]const utf8 const

and

@TypeO("ß" @encoding(.Iso8859_1)) == [2]const latin1

But that is a whole other can of worms.

kyle-github commented 3 years ago

Part of the discussion about the utf8 data pointed out that there are times when you want to have data that is not utf8 compliant.

My thought with something like that is to have the concept of a string_byte. I.e. each byte might not make sense on its own (think about one byte in a multibyte sequence). Only some operations make sense on such data: equality comparison (so you can find the end of a string for instance), extraction of bit fields and/or masking (to figure out if this is the terminal byte in a sequence for instance). As you note, comparison like less than and greater than make little sense here.

Today you can sort-a/kind-a fake this by creating a struct with a single field with a u8 in it. However, then you lose all operators for integer values including the ones we actually want. This turns string handling code into a mess and would likely cause people to immediately stop using it. A gain of safety and a complete loss of users :-/

There are multiple ways to solve this. I would prefer some way to take the basic integer types and remove operators from them:

const string_byte = @MakeBitsType(u8, .{ .allowed_operators = .{ "==", "!=" } });

This is not well thought out, so please ignore the syntax.

There is a strong desire in the Zig community to not allow any kind of operator overloading, so the above does not do that: it only provides an allowed list of existing operators. You can only take away, not add.

String bytes fall out as possible built-in type because you really, really want people to use them correctly.

RogierBrussee commented 3 years ago

@kyle-github In many ways your string byte is what the C char is supposed to be (they are of implementation defined signedness so you cannot portably compare them) but isn't because of historical bagage. In fact Isn't what you define exactly const utf8byte = @enum(u8){ }; //string_byte of Utf8 string const latin1byte = @enum(u8){ }; //string_byte of Iso88591 string const byte = @enum(u8){ }; //string byte of completely arbitrary byte_string

But if you have string literals (and I don't think anybody wants to get rid of those) you need to have something in the language to define what kind of character encoding is used in a string literal.

RogierBrussee commented 3 years ago

Your @MakebitsType gave me an idea for a concise programming type of way to describe what I propose.

define: const Order = enum { Signed Unsigned Unordered }

const Operators = enum{ @"==", @"!=", @"<", .. @"+", @"+%", .. @"&", @"|", @"^", // or xor so one can use ^ for power ??? @"<<", @">>" }

fn @IntegralType(n_bits: comptime_int, order: Order order, ops: []Operators) type

Then define

//NO well defined order, hence no + or - because there is no well defined overflow, and no ">>", //but modular arithmetic (aka 2 complement arithmetic) works just fine, //for the rest all the promiscuity you come to expect from a C type t c_char = @IntegralType(8, .Unordered, .{@"==", @"!=", @"+%", ..., @"&", @"|", @"^", @"<<"});

//Has well defined signedness, and all the promiscuity you come to expect from a C type that is a thin abstraction //an n_bit length machine register of the integer ALU. This is what an u8, u16, u32.. is in Zig now. const c_u8. = @IntegralType(8, .Unsigned, .{@"==", @"!=", @"<", @"<=", @">,", @">=", @"+", @"+%", .. @"&", @"|", @"^", @"<<", @">>"}); const c_u16 = @IntegralType(8, .Unsigned, .{@"==", @"!=", @"<", @"<=", @">,", @">=", @"+", @"+%", .. @"&", @"|", @"^", @"<<", @">>"}); ...

// Same but signed. This is what i8, i32, .. is now. const c_i8 = @IntegralType(8, .Signed, .{@"==", @"!=", @"<", @"<=", @">,", @">=", @"+", @"+%", ..., @"<<", @">>"});
...

//ABI dependent, const c_short = c_i16; const c_ushort = c_u16; ...

// Has well defined order (unsignedness). Implements an abstract unsigned integer of < 2^nbit, so // hides the modular arithmetic (aka 2 complement arithmetic) of the underlying hardware // to proper integer arithmetic, with (debug checked) undefined behaviour on overflow which is just the mathematical // fact of life that e.g. addition is only a partially defined function on a bounded set of integers. // No bit operations (which are a layering violation). This is what I propose to redefine u8, u16, .. to and together with // their signed counterparts should continue to be the go to integer types.

u8 = @IntegralType(8, .Unsigned, .{@"==", @"!=", @"<", @"<=", @">,", @">=", @"+", @"-", @"", @"/"}); // add @"^" for @pow if not used for xor. u16 = @IntegralType(16, .Unsigned, .{@"==", @"!=", @"<", @"<=", @">,", @">=", @"+", @"-", @"", @"/"}); // add @"^" for @pow if not used for xor. u32 = ... .. //same but abstracting signed integers -2^(n_bit-1) < (2^n_bit) - 1, no @"/", use @divTrunc and @divFloor explicitly. i8 = @IntegralType(8, .Signed, .{@"==", @"!=", @"<", @"<=", @">,", @">=", @"+", "-", "*"}); // add @"^" for @pow if not used for xor. i16 = ... .. //Implementation of a bitset {0,1}^n_bit, the go to type for flags. // It should come with other typical bitoperations like // @rotl(), @rotr(), (rotate left and right), @shra() (shift right arithmetic), @setbit(),.. //Perhaps use a keyword "xor" for "^".

b8 = @IntegralType(8, .Unordered, .{@"==", @"!=", @"&", @"|", @"^", @"<<", @">>" }); // with perhaps "xor" for "^"

b16 = ... ...

//Implementation of integers modulo 2^n_bit //(i.e the abstract ring (Z/2^n_bit Z, +, - , *) of congruence classes mod 2^n_bit)). //This is what the hardware provides under the name of 2 complement arithmetic, and as far as arithmetic // (very much including the absence of undefined behaviour) is what a C unsigned integer provides. However // a C unsigned integer also provides comparison which is not compatible with modular arithmetic and takes great care // so if provided should be Boolean functions @ltu(), @lts(), (less than unsigned and signed), // and @leu() and @les() (less or equal than unsigned and signed). This is the go to type if modular arithmetic is useful // in particular if the absence of undefined behaviour is essential or the circular nature is natural. // It is not faster than u8, u16,... in non debug mode, (if anything, it prevents UB optimisations if you try to use them as "faster" unsigned integers).

m8 = @IntegralType(8, .Unordered, .{@"==", @"!=", @"+%", @"-%", @"* %"}); // add @"^%" for @pow if not used for xor. m16 = ... ...

Rogier.

kyle-github commented 3 years ago

@RogierBrussee, I think on the integers you captured where I was going, though with much more mathematical rigor! I was less interested in the integers themselves as in other bag-o-bits types like addresses where you can have strange sizes (i.e. 48 bits on some processors) or strange bit pattern requirements (i.e. x86-64's requirement that all higher bits be all ones or all zeros). I would probably also have an option for how underflow/overflow is dealt with because on DSPs you often get operations for saturating arithmetic.

The things I think are important are:

the number of bits taken up by the representation.
the operations allowed (or denied if that is easier) on the type.
signed or not, and if signed what representation. I am not sure why someone would want ones-compliment but...
how underflow/overflow is handled:
- trap/exception/assert
- modular
- saturating
- loss of precision (i.e. floating point or a fractional type)
- increase/decrease in size
- ...
what kinds of automatic coercion are allowed. I.e. can I just assign this to some other type?

With carefully chosen settings here, you can make 99% of existing Zig code work without change. That would be important for any adoption and to keep things simple.

For strings the two problems I was trying to solve were: 1) formatted printing, 2) debugging output. Right now, you do not know if a []u8 represents a string or not. There is no type information to rely on so any comptime introspection will fail to distinguish something the user meant to be printable from a raw byte blob. My thought here was to have something like string_byte as a type that indicated "on this platform, this byte is part of a printable array or slice." Note the "on this platform".

The reason for having less of a restriction on such bytes was that embedded platforms often do not use utf8 or Unicode in any form simply because of memory constraints. Strings are still printable, but the encoding type is more or less implicit due to the platform itself. Sometimes the encoding is due to hardware and not even software (character displays). In cases like that you may need to add byte data that is not strictly printable on its own (perhaps to change colors on the output).

Questions:

Is there a benefit to having all these options for common code? It is not clear that picking a single set of behavior for unsigned and signed integers is not sufficient for the 99% common case. E.g. people that want saturating arithmetic can use special functions on existing integers (perhaps using intrinsics that map to specific CPU functions on DSP platforms). E.g. people that are writing operating systems can wrap variously sized integers in structs and construct other types that way and use struct-specific functions to do things like masking, bit field extraction etc. to handle addresses from different protection domains.

Zig aims to be simple and explicit. These ideas (which I personally like a lot!) are definitely more explicit, but I am not sure they are simple. If they can be hidden from most programmers by making common definitions then I think this will conform to the Zen of Zig. If programmers are forced to build their own types all the time, then they make the common case too complicated.

Is there utility for a string byte type? Here I think the answer is a more clear, "yes." If you have type information about whether something is printable or not, you can use that in comptime code to determine when to print a byte array/slice. You can drop that into debugging info and thus the debugger can guess more accurately when something should be printable.

I understand why so many in the Zig community dislike unconstrained operator overloading, but handling variations on integral types is one area where it would be extremely handy to be able to say, "for this type, here is the operation you do to add." Using something like that you could, in pure Zig, define all the integral types, define floating point types, handle saturating vs. modular arithmetic, have real fractional types, handle addresses in a generic way, handle BCD types (still used in mainframes) etc. Note that Zig does have polymorphic operators today: + for i8 vs u8 vs f32 vs pointers for instance.

kyle-github commented 3 years ago

There is a good discussion in #159. @SpexGuy has a nice example of where UB can come into the picture.

RogierBrussee commented 3 years ago

I agree that []u8 is suboptimal as a string type, because you cannot tell the difference between a byte and a printable byte.: Even []c_char would be better, I think, except there is no such thing as c_char in Zig (I guess because of an insistence that an integer type has a definite signedness). However that still does not tell you which character encoding is used. I think it is perfectly sane to make Utf8 the default encoding but to be able to specify a particular encoding which is what I suggested to using some sort of @encoding() attribute (which flags the whole string!) or encode the encoding in distinct types like

const utf8byte = enum(u8){} const latin1byte = enum(u8){} const utf16word = enum(u16){}

Or any other way to make distinct integral types of definite size. I don't know if mentioning C++ is considered civil behaviour here but in C++ they introduced distinct char, char8_t, char16_t, char32_t, wchar_t types because of different character encodings over the years and FWIW, I think they were right on this one.

Questions:

I think there is benefit in having clearly distinguishable integral types because they provide typesafety for distinct use cases. In fact the Linux kernel seems to have ways to make distinguishable integral types using "__bitwise" that is an empty #define for GCC but picked up by the sparse compiler to check if integer types for which one does want to use integer operations are used in a typesafe manner. I can also see the benefit in "promiscuous" types like c_u32 or c_uint but mainly because, for historical reasons, C has defined them that way, and I can see the translate C effort (or copy pasting from C code!) to become more difficult if you don't have types that behave the same way. Note that one tends to think of C's "int" and "unsigned int" as the numerical version of mathematical signed and unsigned integers (because that is how they are mostly used), but it is more correct to think of them as integral types with a (powerful) hodgepodge of operations that have error prone precedence rules (and the arithmetic on "unsigned int" is modular!). That made sense when C started out as a portable assembler (in the very earliest incarnations, even pointers were a "long int") but would you make that decision today?
You should not define your own types: I think u, i, b, m, c_u, c_i, c_short, ..c_ulonglong and a new c_char should all be built in types. I used the @IntegralType mainly as a way to explain what I propose in code. I personally think it is easier to explain what u32, i32 (in the proposed sense), m32, and b32 are and then, what c_u32 and c_i32 (i.e. u32 and i32 in the current sense) are as the promiscuous versions that use the underlying bit representation than vice versa, and there are a lot fewer precedence rules to worry about if you use the more typesafe types. I cannot really judge, but I doubt this would make the compiler really more difficult.

But perhaps the way to go is to just rename the current u --> c_u .. i--> c_i defined as the "promiscuous" types they are now, have something like @IntegralType take a type after all (i.e fn @IntegralType(base :type, ops : Operations) ) and "select" operations from the base type to define u, b, m, cushort... and have that imported from "std" thereby making the language smaller. However, conversions between the types are not so simple, and I think just having a few more builtin types is easier. There should still be value in making new integral types. Maybe (thinking aloud) @IntegralType should really be enum with specified operations (using c versions as "current Zig").

// A 32 bit flag 0, A, B, ~A, ~B, A|B, ~A|B, A|~B, ~A | ~B, ~(A|B), , ~(~A|~B), ~0} const Flags32 = enum(c_u32, .{@"==", @"!=", @"|", @"~"}){ A, B } ;

//An 8 bit type with 256 different elements that can only be tested for (in) equality. const Opaque8 = enum(cu8, .{@"==", @"!="}){ }

//Kernel Adresses that can be compared and are representable as an integer >= 2^16 and < 2^48. const KernelAdres = enum(c_i64, .{@"==", @"!=", @"<", @"<=", @">", @">="}){ @pow(2,16) .. @pow(2,48) }

Agreed and it is not just for +,-, but also (and perhaps more problematically) <, <=, > , >=. For what I propose, it is mainly about using +, - , or +%, -%, % for modular m integers. As a mathematician I am happy to use +, -, but I can see why people want to get those honking warning signs that "wrapping" is part of the game, and it sure makes things easier to explain:
u32 and i32 for +, - , <, <=, >, >= (and perhaps ^ for @pow);
m32 for +%, -%, % (and perhaps ^% for @pow);
b32 for &, |, ~, <<, >> and ^ (or perhaps xor).

agathazeren commented 3 years ago

While I like the proposed @IntType, there are a number of knobs to turn on integers other than size and operations, and I think that enumerating the complete set of useful integers is not going to be successful, and so, I think the best way to do it is to provide a much larger number of knobs on integers, including size, signedness, and operations with their own knobs. Then, the standard library or language can define the most useful types, similar to what we have today, while still allowing people to define the specific types required for their use case, such as mXX, ones complement integers, &ct.

I have sketched out a possible version of the idea here: https://paste.sr.ht/~powerofzero/6f2b77583e8132dd541fc2f44510762b2b59f2db

Excerpt:

const u8 = @Type(TypeInfo.Int(.{
    .size = .Bits(8),
    .interpretation = .Unsigned,
    .operations = .{
        .equality = .Mem,
        .comparison = .Mem,
        .arithmetic = .{ .overflow = .Undefined, .div_by_zero = .Undefined },
        .modulo_arithmetic = .Mem,
        .bitwise = .Complete, // I think this should be null, for u8, but that is orthogonal
    },
}));

RogierBrussee commented 3 years ago

@asa-z Thanks a lot for making these ideas concrete, and pointing out that Zig already supports much of it!

Because "integer" seems to be a lot of confusion after years of using C-like language which ingrained the notion of "integer" as a collection of bits in a CPU register usually but certainly not always encoding an integral number through two complement. I think that to avoid that confusion it is very important to get terminology right.
Therefore can I propose:

The interpretation is different mathematical structures: signed integral numbers (what mathematicians call integers, i.e. Z), unsigned integral numbers (what mathematicians call natural numbers, i.e. N), Congruence classes of Integers modulo a number n (what mathematicians call "the ring of integers mod n", i.e. Z/nZ, and where for real computers things are very efficient if that number n is a power of 2 and certainly warrants special casing), a vector of bits (what mathematicians call n-fold Cartesian product of {0,1} i.e. {0,1}^n) or if we want to be fancy, the code units of a character encoding like UTF8 or UTF16, polynomials over the Galois Field GF(2) or closely related to that, the Galois fields GF(2^n) which have hardware support for many architectures. *An interpretation assumes some explicit representation e.g. 2 complement and 1 complement are representations of (almost) the same interpretation.

So I think a bit more verbosity is needed (it is not like defining new integral types would be common thing to do) and using mathy words to clearly distinguish the mathematical/conceptual from the "C-like language" notions. Perhaps something like the below (just a sketch)

const Size = enum {
    BitSize: comptime_int,
    BitSizeOfPointer, // Size of a pointer
    BitSizeOfByte, // Smallest usable unit
    BitSizeOfWord, // Smallest loadable unit
    BitSizeOfCChar,
    BitSizeOfCShort,
    BitSizeOfCInt,
    BitSizeOfCLong,
    BitSizeOfCLongLong,
    _,
};

const CharacterEncoding{ UTF8, UTF16, UTF32, Ascii, Iso8859-1, Iso8859-2, // a few dozen more encodings, _ }

const Interpretation = enum {
    SignedIntegralNumber, 
    UnsignedIntegralNumber,
    CongruenceClassMod2Pow: comptime_int,
    CongruenceClassModulo: comptime_int
    BitVector:comptime_int,
    PolynomialOverGF2Deg: comptime_int,
    GaloisFieldGF2Pow: comptime_int,
    CodePoint: CharacterEncoding,
    _,
};

const Representation{ TwosComplement, TwosComplementLowerBits: comptime_int, TwosComplementRemainder:comptimeint, BitsInNativeOrder, BitsInLEOrder, BitsInBEOrder, OnesComplement, NANBoxedDouble, BinaryCodedDecimal, }

const Operations = struct { // No need for separate fields for each operation, because allmost all // of the time you will want to enable categories (??) equality: ?Equality = null, comparison: ?Comparison = null, integral_arithmetic_operations: ?ArithmeticOperations = null, modular_arithmetic_operations: ?ArithmeticOperations = null, bitwise_operations: ?BitwiseOperations = null,

    bitwise_definedness:?BitwiseDefinedness = null

    const Equality = enum {
        EqualityOfLastBits: comptime_int,
        EqualityOfRemainderNumber: comptime_int,
        EqualityOfRemainderPolynomial: comptime_int
        _,
    };

    const Comparison = enum {
        SignedComparison,
        UnsignedComparison,
        Lexicographic,
        _,
    };

     const Overflow = enum {
            Undefined,
            WrapSilently,
            CannotHappen, // should perhaps be equal to WrapSilently even though subtly different ????
            Saturate,
            TrapsError,
             ReturnError,  // return an error union
             ReturnValue: comptime_int
        };

        const DivByNonDivisible = enum {
            Undefined,
            TrapError,
            ReturnError // return an error union,
            ReturnValue: comptime_int,
        };

    const OperationsType = enum {
          SignedNumber,
          UnsignedNumber, // minus is iffy!
          OptionalSignedNumber, 
          OptionalUnsignedNumber,
          Modular2Pow,
          ModularResidue,
          GF2Polynomial,
          GaloisField
    }

    const ArithmeticOperations = struct {
        type: OperationsType,
        overflow: ?Overflow,
        div_by_non_divisible: ?DivByNonDivisible,
    }

    const BitwiseOperations = struct{
         overshift: ?Overflow,
     }

    const BitwiseDefinedness = enum {
        PerBit,     // `undefined`ness is tracked on a per-bit level
        PerValue // if any bit is `undefined` then the whole thing is `undefined`
    };

};

And with this we have a status quo, promiscuous, C int8_t like u8 which I would like to call c_u8

const c_u8 = @Type(TypeInfo.Int(.{ .size = .Bits(8), .interpretation = null, //no proper interpretation! .representation = .TwoComplement, .operations = .{ .equality = .EqualityOfLastBits(8), .comparison = .UnsignedComparison, .integral_arithmetic_operations = .{.type = UnsignedNumber, .overflow = .Undefined, //note: not the C behaviour which is silently wrap!!!! .div_by_indivisible = .Undefined }, .modular_arithmetic_operatons = .{.type = Modular2Pow , .overflow = .CannotHappen, //note: the C behaviour aka silently wrap!!! .div_by_indivisible = .Undefined}, .bitwise_operations = .{.overshift = .Undefined} .bitwise_definedness = .PerValue, // but if used bitwise argueably per bit!!! Hence one could argue this has to be null }, }));

So my proposal boils down to

const u8 = @Type(TypeInfo.Int(.{ .size = .Bits(8), .interpretation = .UnsignedIntegralNumber, .representation = .TwoComplement, .operations = .{ .equality = .EqualityOfLastBits(8), .comparison = .UnsignedComparison, .integral_arithmetic_operations = .{.type = UnsignedNumber, .overflow = .Undefined, //note: not the C behaviour!!!! .div_by_indivisible = .Undefined }, .modular_arithmetic_operatons = null , .bitwise_operations = null,

    .bitwise_definedness = .PerValue, 
},

}));

const m8 = @Type(TypeInfo.Int(.{ .size = .Bits(8), .interpretation = .CongruenceClassMod2Pow(8), .representation = .TwoComplement, .operations = .{ .equality = .EqualityOfLastBits(8), .comparison = null, .integral_arithmetic_operations = null, .modular_arithmetic_operatons = .{.type = CongruenceMod2Pow , .overflow = .CannotHappen,
.div_by_indivisible = .Undefined}, .bitwise_operations = null , .bitwise_definedness = .PerValue, }, }));

const b8 = @Type(TypeInfo.Int(.{ .size = .Bits(8), .interpretation = .BitVector(8), .representation = .BitsInNativeOrder, .operations = .{ .equality = .EqualityOfLastBits(8), .comparison = null, .integral_arithmetic_operations = null, .modular_arithmetic_operatons = null .bitwise_operations = {.overshift = .Undefined} , .bitwise_definedness = .PerBit, }, }));

And just for kicks let us define UTF and Ascii code units:

const UTF8cu = @Type(TypeInfo.Int(.{ .size = .Bits(8), .interpretation = .CodeUnit(UTF8), .representation = .BitsInNativeOrder, .operations = .{ .equality = .EqualityOfLastBits(8), .comparison = null, //you cannot properly compare codepoints, you need the whole []UTF8cu. //to get glyphs, (which for UTF8 needs no byte order) to compare and sort. .integral_arithmetic_operations = null, .modular_arithmetic_operatons = null .bitwise_operations = null , .bitwise_definedness = .PerValue, }, }));

const UTF16LEcu = @Type(TypeInfo.Int(.{ .size = .Bits(16), .interpretation = .CodeUnit(UTF16), .representation = .BitsInLEOrder, .operations = .{ .equality = .EqualityOfLastBits(16), //works, whatever the byte order .comparison = null, //you cannot properly compare codepoints: you need the whole []UTF16cp. // to get glyphs (which for UTF16 does need byte order) to compare and sort. .integral_arithmetic_operations = null, .modular_arithmetic_operatons = null .bitwise_operations = null , .bitwise_definedness = .PerValue, }, }));

const Asciicu = @Type(TypeInfo.Int(.{ .size = .Bits(8), .interpretation = .CodeUnit(Ascii), .representation = .TwoComplementLowerbits(7), .operations = .{ .equality = .EqualityOfLastBits(7), .comparison = .UnsignedComparison, //you can compare Ascii code units no problem (if you keep bit 8 zero!) .integral_arithmetic_operations = null, //you can meaningfully subtract Ascii and it is no big deal to add them but lets stay 7bit and do it modular. .modular_arithmetic_operatons = {.type = CongruenceMod2Pow ,
.overflow = .CannotHappen,
.div_by_indivisible = .Undefined} .bitwise_operations = null, .bitwise_definedness = .PerValue, }, }));

The UTFX examples show that more thought is needed for how these "In C you use an integer" classes can be (bit)casted to each other and in particular to usize. Problems like byte order do not magically disappear, although explicitness helps avoid confusion.

Rogier.

data-man commented 3 years ago

Between operators overloading and the invention of square wheels, I choose operators overloading.

agathazeren commented 3 years ago

So, I've been thinking, and I think we need to find a much simpler definition, even if it has less flexibility. (Though you should be able to make anything in userspace on top of binary numbers.

Here's what I came up with:

const Int = struct {
    size: Size,
    endian: Endian = .Native,
    edge: Edge, // perhaps should be named `safety`?
    interpretation: Interpreation,

    const Edge = enum {
        Undefined, //saftey-checked ub
        ErrorUnion, // return error union
        Panic, //panic, even in RelaseFast
        _,
    }

    const Interpreation = union(enum) {
        Arithmetic: struct {
            signedness: enum {
                Signed, Unsigned, Positive,SignedOnesComplement, _
            },
            kind: enum {
                Modulo, Bounded
            },
        },
        Binary: struct {
            bitwise_defined: bool,
            shift: enum {
                Arithmetic, Binary
            },
        }
        Token, // Or Cases, or Enum, for hings like enums and codeunits
    }

};

However, there is one big problem: this does not allow you to natively represent C-style integers. I have two solutions to this, but I am not particuarly fond of either:

Solution 1: Add a C Interpretation.


const Int = struct {
    size: Size,
    endian: Endian = .Native,
    edge: Edge, // perhaps should be named `safety`?
    interpretation: Interpreation,

    const Edge = enum {
        Undefined, //saftey-checked ub
        ErrorUnion, // return error union
        Panic, //panic, even in RelaseFast
        _,
    }

    const Interpreation = union(enum) {
        Arithmetic: struct {
            signedness: enum {
                Signed, Unsigned, Positive,SignedOnesComplement, _
            },
            kind: enum {
                Modulo, Bounded
            },
        },
        Binary: struct {
            bitwise_defined: bool,
            shift: enum {
                Arithmetic, Binary
            },
        }
        Token, // Or Cases, or Enum, for hings like enums and codeunits
        C: struct {
            signedness: enum {
                Signed, Unsigned
            },
        }
    }

};

Solution 2: Make the interpretations not mutually exclusive

const Int = struct {
    size: Size,
    endian: Endian = .Native,
    edge: Edge, // perhaps should be named `safety`?
    arithmetic: ?Arithmetic,
    binary: ?Binary,
    token: ?void, // or void?

    const Edge = enum {
        Undefined, //saftey-checked ub
        ErrorUnion, // return error union
        Panic, //panic, even in RelaseFast
        _,
    }

    const Arithmetic = struct {
            signedness: enum {
                Signed, Unsigned, Positive,SignedOnesComplement, _
            },
            kind: enum {
                Modulo, Bounded
            },
    };
    const Binary = struct {
            bitwise_defined: bool,
            shift: enum {
                Arithmetic, Binary
            },
    };
};