ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.57k stars 2.53k forks source link

Pointer Reform #770

Closed andrewrk closed 6 years ago

andrewrk commented 6 years ago

EDIT

& only used for address-of, no longer designates a pointer type. Necessary because of #588

All of them support pointer indexing and slicing except ^. Only [*] supports pointer arithmetic. All of them implicitly cast to [*]. []null and [N]null implicitly cast to [*]null.

new array syntax

var array: 4*i32 = undefined;

Now it is clear whether you should do &array or &array[0]. Don't use &array. If you want a [N]T, e.g. a pointer with comptime known length, use array[0..]. If the function wants to access more than one element, you'll do this. Otherwise, &array[0], will give ^T, which would trigger a compile error if the array was length 0, and only this element can be accessed via this pointer.

This paves the way for #733 See also #386 See also #568

Ilariel commented 6 years ago

If possible I would like to have some other symbol/keyword for:

^ pointer to exactly 1 thing.

This is because ^ isn't ergonomic on all keyboard layouts. On Windows with some international/non-English keyboard layouts you have to type it twice to get ^^ and then you have to remove the extra every time you write it. See this superuser question Sure it is used in bitwise xor, but you tend to have more pointers in your program than bitwise xors.

andrewrk commented 6 years ago

That's good to know. Do you have a suggestion for what other symbol to use?

Ilariel commented 6 years ago

To be honest I have only few possibly reasonable ideas as most commonly used or reasonable sigils have been used for something in Zig already.

  1. $ -symbol is widely used in many programming languages. However this would a "foreign" sigil to learn as it is not used for this purpose in other languages.
  2. ref/ptr -keyword, keyword bloat, takes a common variable name. However might be friendly for the reader (ref T)/(ptr T)
  3. Change the current & operator to keyword/builtin (@)address_of and then use & as a sigil.
Hejsil commented 6 years ago

Is the new array syntax flexibly like multiply? Can I do these?

var array1: i32*4 = undefined;
var array2: 4*4*i32 = undefined;
var array3: 4*(4*i32) = undefined;

Or is it restricted to <comptime_int>*<type>? Aka, I would have to write array2 like this:

var array2: (4*4)*i32 = undefined;
andrewrk commented 6 years ago

@Hejsil great questions. I think that new array syntax is no good because of this. But we have to make something different than [N]T to distinguish from pointers.

thejoshwolfe commented 6 years ago

4*i32 is no good, because the * operator elsewhere doesn't change the type of something. Even ** turns arrays with a child type into another array with the same child type; it only changes the size. We need a way to turn a scalar into an array, which is not like any infix operator in the language.

Perhaps array[4]i32 where array is a keyword? Having a [4]i32 somewhere in there fits nicely with [4]i32 being a pointer to such.

I expect people will mistakenly declare their structs with [4]i32 instead of whatever the actual array syntax ends up being. Then they'll initialize the pointer to undefined thinking they initialized the elements to undefined, and then begin assigning into the elements, which will cause undefined behavior at runtime. Seems like a footgun.

raulgrell commented 6 years ago

I agree with @Ilariel, my keyboard is one of these (Portuguese) - If you press the ^ key once, followed by a vowel you get, for example, â. But with a consonant, you get for example, ^w. But tapping twice and erasing can break your typing flow - It's been surprisingly hard to build muscle memory for it. Other characters like this include the tilde ~ and the backtick/grave accent. The caret/circumflex accent ^ requires shift to be pressed and is on the same key as the tilde. The backtick/grave is on the same key as the very similar looking acute accent. I'm not sure about other keyboard layouts but these are very uncomfortable characters for the Portuguese.

I also liked @Ilariel's 2nd and 3rd suggestions. I don't dislike the idea of a ref/ptr keyword, but I find &?&&?T more readable than ref ? ref ref ? T for types. I think I'd prefer to keep the & as a reference type and instead use a builtin or an operator like # or $ for address-of.

As for the array syntax, would N[]T be possible? An array is a pointer to a block of memory with runtime known length just like status quo slices, so it's conceptually consistent at least.

const a: 3[]u32 = 3[]u32{1, 2, 3};
const a = 3[]u32{1, 2, 3};

const b = (2*2)[]u32{1, 2, 3, 4};

@thejoshwolfe made a good point about * not changing something's type, and ** changing something's size. But changing an array of one size into an array of another size is also changing the type of the array. Why not use ** itself?

const a: 3**u32 = 3**u32{1, 2, 3};
const a = 3**u32{1, 2, 3};

const b = (2*2)**u32{1, 2, 3, 4};

const c = a ++ b;

It communicates that you're creating a value type which is the result of putting N units of that value type together. If you allow ** and ++ to operate on scalars, the following could create an anonymous tuple.

const T = 4**u32 ++ bool 

const t : T = 4**u32{1,2,3,4} ++ true;
const t = 4**u32{1,2,3,4} ++ true;
thejoshwolfe commented 6 years ago

We can't use ** for making arrays out of scalars, because types are comptime values.

const State = 256**u8; // formerly known as [256]u8
const States = 4 ** State; // is this [1024]u8 or [4][256]u8?
raulgrell commented 6 years ago

I thought the both the ++ and ** operator were already only available with comptime values. In the example you gave, perhaps:

const State = 256**u8; // formerly known as [256]u8
const States = 4**State; // This would be [4][256]u8, ie (4 **(256**u8))
const States = (4 * State.len) ** State.child_type  // this is [1024]u8

Though I can see how this isn't ideal.

EDIT

Just to consolidating my 2 cents to the pointer-to discussion after a bit of thought.

In C you have & as the address-of operator, and pointers are declared with a * In C++ you have std::addressof, and references are declared with a &

In Zig, the 'pointer to exactly one thing' is closer to a C++ reference than a C pointer, so it would make sense to stay close to their interface, and use a builtin @addressOf function.

tgschultz commented 6 years ago

So it is documented and can be discussed here, we came up a possible solution in IRC:

Advantages:

Disadvantages:

So if we went this route we'd need a new dereferencing operator. Personally I favor that anyway since prefix * is already a bit weird.

var z = x**y;  //x * (*y)
var a = *s.m.v; //dereferences "v", not s

One thought I had, and I realize this is a bit strange but hear me out, is postfix .. My reasoning is that . already sort of dereferences implicitly when used with structs.

var s = MyStruct{.v = 10};
var sp = &s;
var v = sp.s; 

so all we'd be doing is extending this property to non-structs, really.

var s = i32(25);
var sp = &s;
var v = sp.;  //dot dereferences sp
//...
var p = Point{.x = 0, .y = 10};
var pp = &&p;
var x = p..x;  //currently (*p).x, 

Advantages:

Disadvantages:

So then we'd need new range operators too. -> is available unless it is resurrected for return types.

Other options for deref:

thejoshwolfe commented 6 years ago

[*N] pointer to a block of memory with comptime known length

Why not *[N]? That already means a pointer to a block of memory with comptime known length according to the rest of the proposal, and it doesn't have the ambiguity with *N.

[*N]null pointer to a block of memory with comptime known length, and a null/0 at ptr[N]

Instead of this concept, introduce [N]null which is an array of comptime known length with a null/0 at arr[N]. Then we just do *[N]null for a pointer to it.

Other options for deref:

Zig's grammar depends on knowing if we are at the end of an expression or in the middle of an operator. This means we can't have postfix operators that are identical to infix operators. Here's an example of the ambiguity using ^ as proposed for postfix pointer deref:

// this is ambiguous
const a = b^(1);

// b is a pointer to a function (or double pointer to a function),
// which is being called and given the parameter 1.
const a = (b^)(1);

// b is some integer being xor'ed with 1.
const a = b ^ 1;

We absolutely cannot have ambiguity between infix and postfix operators. This means ^, *, and > can't be used as postfix operators.

There is no problem with ambiguity of infix and prefix operators though, such as with * for deref and multiplation, - for negative and subtraction, & for address of and bitwise and, etc. The ability to distinguish between prefix and infix operators is what we get for having the above limitation with postfix and infix. And since it's so important to have - and ( as infix and postfix operators, the tradeoff to allow prefix/infix instead of postfix/infix is a no-brainer.

One thought I had, and I realize this is a bit strange but hear me out, is postfix ..

This actually does not suffer too horribly from the above ambiguity concern, because . isn't really an infix operator. After a ., you have to have an identifier, and an identifier can't ever be a postfix or infix operator. That being said, it still looks pretty horrible.

a.0 doesn't seem as bad, because at least there's something after the dot. This eliminates the problems with .. as you pointed out. But a.0 is bad because it begs the question: can I do a.1? It looks like the 0 is actually a zero, when it's really more of a keyword. 0 doesn't function like that in any other syntactic context, so this seems pretty bad too.

Personally I favor that anyway since prefix * is already a bit weird.

I'm not too concerned about prefix vs postfix, since this language (and pretty much every other language) has operator precedences that will get weird sometimes like this. Instead of moving operators around, I would propose requiring parentheses sometimes, if the confusion is bad enough. See #114.

The biggest problem I have with * meaning "pointer to" and * meaning "deref pointer" is that those are opposite directions. C has this problem, and I think it's one of the reasons children find pointers hard to learn in C (that and lack of respectable compile errors). If * means "pointer to", then we do need a new "pointer deref" operator, but it could be prefix or postfix. We could use ^ for prefix deref, or even >. There's a lot of space for new prefix operators, but less so with postfix/infix operators.

tjpalmer commented 6 years ago

Ignoring the syntax, I am really happy with the idea of separating pointer-to-1 from pointer-to-unknown-quantity and so on. (Sad that caret doesn't work easily.)

raulgrell commented 6 years ago

a.0 doesn't seem as bad, because at least there's something after the dot. This eliminates the problems with .. as you pointed out. But a.0 is bad because it begs the question: can I do a.1? It looks like the 0 is actually a zero, when it's really more of a keyword. 0 doesn't function like that in any other syntactic context, so this seems pretty bad too.

We could have a postfix .&. Borrowing @tgschultz's example:

var s = i32(25);
var sp = &s;
var v = sp.&;  // .& dereferences sp

var p = Tuple {.x = s, .y = sp, .z = &sp};
var pp = &&p;
var x: i32 = pp.&.x;  //currently (*pp).x, 
var y: i32 = pp.&.y.&; // currently *(*pp).y
var z. i32 = pp.&.z.&.&; // currently **(*p).z

It doesn't imply indexing, no ambiguities with range syntax, analogous to fields in structs: get value at field vs get value at address.

thejoshwolfe commented 6 years ago

.& looks pretty weird, and it would be the only postfix operator in zig with no variability. The other postfix operators are x{...}, x(...), x[...], x.y, which all take some form of variability as an effective parameter. But I admit I appreciate the symmetry between &x and x.& for the two different ref/deref directions.

I don't think the postfix chaining argument, e.g. ptr.&.&[0].&(0), is very strong, because I think double pointer dereferencing is pretty rare. Even better than array_pointer.&[0] would be array_point[0] that includes an implicit .&. This is already how struct_pointer.member and function_pointer(0) work. We don't want double implicit .&.&, because as claimed above, the double pointer deref usecase is pretty rare, and I think that would be too confusing.

I think .& is the best thing proposed so far for a new pointer deref operator. I'll post an updated proposal in a moment...

thejoshwolfe commented 6 years ago

(I'm writing this all up in a GitHub issue, but this is intended to go into the documentation somewhere.)

Note: In the following discussion, there is sometimes a distinction between the length of an array and the len field of a thing. An array of length n is defined to be n elements arranged consecutively in memory. The len field of a thing is defined to be whatever is convenient and meaningful in each context it appears.

This proposal does not introduce any new tokens, which means for example that [*]null is equivalent to [ * ] null.

Types

Here T represents any type, which is the "child type". N represents any expression evaluating to a comptime integer. attrs is the place in the syntax where the pointer attributes go, such as const, volatile, align(A), any combination of those, or nothing. The syntax of the attrs is not discussed here, but it's important to note where they go in each syntactic construct.

Dereferencing

For a pointer p of any pointer type, p.& dereferences the pointer.

This operator is implied in the following contexts:

These implicit dereferences do not apply to an expression that is the result of applying one of these implicit dereference rules. For example, p.id is never equivalent to p.&.&.id.

Pointer Arithmetic

For a pointer p of a pointer type with pointer arithmetic enabled, the following operators are allowed, where i is any expression of type usize or isize.

The following infix operators are allowed, but are not precisely defined here. Informally, these operations are defined similar to C, where i is multiplied by the size of p's child type, and then added to or subtracted from the integer value of p. These operators are not commutative; the pointer operand has to be on the left. For example, i + p is a type error.

Pointer subtraction is also allowed in some cases. Given p1: [*]attrs1 T or p1: [*]null attrs1 T and p2: [*]attrs2 T or p2: [*]null attrs2 T, the following operators are sometimes allowed:

Implicit Casting

TODO

kyle-github commented 6 years ago

@thejoshwolfe, you had slightly different syntax a couple of days ago. At the risk of bike shedding, it seems like the * before the type being pointed to makes more sense given the way the rest of the syntax works. I.e. in the rest of Zig:

So if Tis []i32, then a pointer to that should be *[]i32. Putting the * inside the [] seems inconsistent. Type modifiers are then all right associative (I think, I might be reversing that...).

If you want to make pointers to single objects a supported kind of thing in the language, perhaps make them act like transparent references:

var x: i32;
var y: &i32 = &x; //y and x refer to the same location in memory/register.
                          // y aliases x.

x = 14;
assert(y == 14)...

y= 42;
assert(x == 42)...

C++ has moved to this separation of references and pointers (even though we all know that under the hood a reference is syntactic sugar around a pointer!) and it makes a lot of code cleaner. Think about functions that take or return references. This way, there is no dereference at all for an alias/reference.

Then you can use * for pointers on which arithmetic is possible.

On a more frivolous note. Here are some other ideas for the various kinds of arrays.

The first one is identical to a C99 dynamic array, the second to a normal C array and the third can be used for C strings.

So then you get:

[10]&i32 - an array of ten references to i32.
[#]&i32 - an array of references to i32, size known at run time.
[?]u8 - a C-style string.
*i32 - a pointer, with arithmetic, to i32 values.

var x:i32;
var y:&i32 = &x;
y = x; // meaningless since x and y are aliased.
x = 42; // now y == 42 as well.
var a = x + 1; // perfectly legal, a is now 43, no pointer action here.

var p: [?]u8 = func_that_returns_a_c_string(blah); 
var q: *u8 = *p[42]; // not sure about this...
q = p +1; // valid

Apart from the frivolity, I really like the idea of having a pointer to one object (otherwise known as a reference or alias in other languages) and a pointer on which arithmetic can be done. This is really nice!

raulgrell commented 6 years ago

@thejoshwolfe the proposal looks great, though I almost thought the .& was too ugly to propose!

While we're discussing different kinds of arrays, what do you think of an enum array? It has a length equal to the member count of the enum and can only be indexed with an enum value.

const Axis = enum { X, Y, Z};
const vec3 = [Axis]f32 {0.0, 0.0, 0.0};
vec3[Axis.X] == 0.0;

You can get close to this with status quo Zig by specifying the tag type and casting

const Value = enum(u2) {  Zero, One, Two };
const vals = [3]i32{3, 4, 5};
vals[u2(Value.Zero)] == 3;

But then if you change the number of elements, the backing type of the enum or override the values, you need to change a lot of code. And you could still access it with arbitrary integers, so if at any point the index into the array was hardcoded, it would have to be found.

An enum array basically becomes a comptime-checked map!

It could be approximately implemented in userland with something like this if we had a memberIndex built-in or something:

fn EnumArray(comptime T: type, comptime U: type) -> type {
    return struct {
        data: [@memberCount(T)]U,

        const Self = this;

        fn get(&self: Self, tag: t) U {
            return data[@memberIndex(t)];
        }

        fn set(&self: Self, tag: t, value u) void {
            data[@memberIndex(t)] = u;
        }
    }
}

const value_map = EnumArray(Value, i32) {
    .data = []i32{3, 4, 5}
}

value_map.get(Value.Zero) == 3;
value_map.set(Value.Two, 10);
thejoshwolfe commented 6 years ago

While we're discussing different kinds of arrays,

I broke that out into its own issue: #793

jido commented 6 years ago

[N]null T an array of T of length N + 1 with a null/0 at index N. The array has a pseudo field len equal to N. Subscripting at index i: usize is bounded (language-level assertion) by i < N + 1.

Does the null/0 have to be at index N in that case? C strings are stored in fixed length array but the string length can vary, it is not necessarily equal to the array size. The same applies to null-terminated C arrays.

KingOfThePirates commented 6 years ago

I usually wouldn't comment when I don't have competence in the area, in this case pointers, but I feel compelled to share my abstract thoughts. I hope one of these brainstorming ideas could either be or lead to useful ideas:

*variable_name to set. /variable_name to deference

1*variable_name could work as a singular pointer *variable_name: T is a C pointer const &variable_name is a C++ reference pointer *variable_name = Pointer.deferAlloc() for runtime (in concept, it doesn't have to a class with a method - but I felt it didn't fit in the same realm as single character symbols next to the asterisk) @variable_name deletes a pointer and also sets it to null, and maybe do more pointer management **variable_name get the address

T is also where you could put your brackets if you need to declare the type as an array

Speaking of arrays, something like [@] or []@ could mean a C string. @ seems to be the easiest symbol for null after looking at some wikipedia pages.

andrewrk commented 6 years ago
andrewrk commented 6 years ago

I just pushed 96164ce61377b36bcaf0c4087ca9b1ab822b9457 which disables indexing for single-item pointers and enables pointer arithmetic for unknown length pointers.