Pointer Reform - Githubissues

ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

https://ziglang.org

MIT License

34.57k stars 2.53k forks source link

Pointer Reform #770

Closed andrewrk closed 6 years ago

andrewrk commented 6 years ago

EDIT

Latest iteration of the proposal: https://github.com/ziglang/zig/issues/770#issuecomment-368111127
Progress: https://github.com/ziglang/zig/issues/770#issuecomment-394069958

& only used for address-of, no longer designates a pointer type. Necessary because of #588

^ pointer to exactly 1 thing.
[*] pointer to a block of memory of unknown length
[*]null pointer to block of memory, null-terminated (or 0 terminated for integers). #265
[] pointer to a block of memory with runtime known length. status quo slices.
[]null pointer to a block of memory with runtime known length, with a null/0 at ptr[len]
[N] pointer to a block of memory with comptime known length
[N]null pointer to a block of memory with comptime known length, and a null/0 at ptr[N]

All of them support pointer indexing and slicing except ^. Only [*] supports pointer arithmetic. All of them implicitly cast to [*]. []null and [N]null implicitly cast to [*]null.

&ptr[x] and &foo always gives a ^.
ptr[x..y] with comptime known x and y gives a [N].
array[x..] gives a [N].

new array syntax

var array: 4*i32 = undefined;

Now it is clear whether you should do &array or &array[0]. Don't use &array. If you want a [N]T, e.g. a pointer with comptime known length, use array[0..]. If the function wants to access more than one element, you'll do this. Otherwise, &array[0], will give ^T, which would trigger a compile error if the array was length 0, and only this element can be accessed via this pointer.

This paves the way for #733 See also #386 See also #568

Ilariel commented 6 years ago

If possible I would like to have some other symbol/keyword for:

^ pointer to exactly 1 thing.

This is because ^ isn't ergonomic on all keyboard layouts. On Windows with some international/non-English keyboard layouts you have to type it twice to get ^^ and then you have to remove the extra every time you write it. See this superuser question Sure it is used in bitwise xor, but you tend to have more pointers in your program than bitwise xors.

andrewrk commented 6 years ago

That's good to know. Do you have a suggestion for what other symbol to use?

Ilariel commented 6 years ago

To be honest I have only few possibly reasonable ideas as most commonly used or reasonable sigils have been used for something in Zig already.

$ -symbol is widely used in many programming languages. However this would a "foreign" sigil to learn as it is not used for this purpose in other languages.
ref/ptr -keyword, keyword bloat, takes a common variable name. However might be friendly for the reader (ref T)/(ptr T)
Change the current & operator to keyword/builtin (@)address_of and then use & as a sigil.

Hejsil commented 6 years ago

Is the new array syntax flexibly like multiply? Can I do these?

var array1: i32*4 = undefined;
var array2: 4*4*i32 = undefined;
var array3: 4*(4*i32) = undefined;

Or is it restricted to <comptime_int>*<type>? Aka, I would have to write array2 like this:

var array2: (4*4)*i32 = undefined;

andrewrk commented 6 years ago

@Hejsil great questions. I think that new array syntax is no good because of this. But we have to make something different than [N]T to distinguish from pointers.

thejoshwolfe commented 6 years ago

4*i32 is no good, because the * operator elsewhere doesn't change the type of something. Even ** turns arrays with a child type into another array with the same child type; it only changes the size. We need a way to turn a scalar into an array, which is not like any infix operator in the language.

Perhaps array[4]i32 where array is a keyword? Having a [4]i32 somewhere in there fits nicely with [4]i32 being a pointer to such.

I expect people will mistakenly declare their structs with [4]i32 instead of whatever the actual array syntax ends up being. Then they'll initialize the pointer to undefined thinking they initialized the elements to undefined, and then begin assigning into the elements, which will cause undefined behavior at runtime. Seems like a footgun.

raulgrell commented 6 years ago

I agree with @Ilariel, my keyboard is one of these (Portuguese) - If you press the ^ key once, followed by a vowel you get, for example, â. But with a consonant, you get for example, ^w. But tapping twice and erasing can break your typing flow - It's been surprisingly hard to build muscle memory for it. Other characters like this include the tilde ~ and the backtick/grave accent. The caret/circumflex accent ^ requires shift to be pressed and is on the same key as the tilde. The backtick/grave is on the same key as the very similar looking acute accent. I'm not sure about other keyboard layouts but these are very uncomfortable characters for the Portuguese.

I also liked @Ilariel's 2nd and 3rd suggestions. I don't dislike the idea of a ref/ptr keyword, but I find &?&&?T more readable than ref ? ref ref ? T for types. I think I'd prefer to keep the & as a reference type and instead use a builtin or an operator like # or $ for address-of.

As for the array syntax, would N[]T be possible? An array is a pointer to a block of memory with runtime known length just like status quo slices, so it's conceptually consistent at least.

const a: 3[]u32 = 3[]u32{1, 2, 3};
const a = 3[]u32{1, 2, 3};

const b = (2*2)[]u32{1, 2, 3, 4};

@thejoshwolfe made a good point about * not changing something's type, and ** changing something's size. But changing an array of one size into an array of another size is also changing the type of the array. Why not use ** itself?

const a: 3**u32 = 3**u32{1, 2, 3};
const a = 3**u32{1, 2, 3};

const b = (2*2)**u32{1, 2, 3, 4};

const c = a ++ b;

It communicates that you're creating a value type which is the result of putting N units of that value type together. If you allow ** and ++ to operate on scalars, the following could create an anonymous tuple.

const T = 4**u32 ++ bool 

const t : T = 4**u32{1,2,3,4} ++ true;
const t = 4**u32{1,2,3,4} ++ true;

thejoshwolfe commented 6 years ago

We can't use ** for making arrays out of scalars, because types are comptime values.

const State = 256**u8; // formerly known as [256]u8
const States = 4 ** State; // is this [1024]u8 or [4][256]u8?

raulgrell commented 6 years ago

I thought the both the ++ and ** operator were already only available with comptime values. In the example you gave, perhaps:

const State = 256**u8; // formerly known as [256]u8
const States = 4**State; // This would be [4][256]u8, ie (4 **(256**u8))
const States = (4 * State.len) ** State.child_type  // this is [1024]u8

Though I can see how this isn't ideal.

EDIT

Just to consolidating my 2 cents to the pointer-to discussion after a bit of thought.

In C you have & as the address-of operator, and pointers are declared with a * In C++ you have std::addressof, and references are declared with a &

In Zig, the 'pointer to exactly one thing' is closer to a C++ reference than a C pointer, so it would make sense to stay close to their interface, and use a builtin @addressOf function.

tgschultz commented 6 years ago

So it is documented and can be discussed here, we came up a possible solution in IRC:

* pointer to exactly 1 thing.
[*] pointer to a block of memory of unknown length
[*]null pointer to block of memory, null-terminated (or 0 terminated for integers). #265
[] pointer to a block of memory with runtime known length. status quo slices.
[]null pointer to a block of memory with runtime known length, with a null/0 at ptr[len]
[*N] pointer to a block of memory with comptime known length
[*N]null pointer to a block of memory with comptime known length, and a null/0 at ptr[N]
[N] block of memory (an array).

Advantages:

* is a familiar symbol for "pointer".
* doesn't have the keyboard issues ^ does.
*,[*],[*N] are consistent.
all blocks use [ ]
array syntax doesn't change

Disadvantages:

Slice syntax doesn't contain * even though it points to data 🤷‍♂️
* in [*N] is ambiguous with * as a dereference operator

So if we went this route we'd need a new dereferencing operator. Personally I favor that anyway since prefix * is already a bit weird.

var z = x**y;  //x * (*y)
var a = *s.m.v; //dereferences "v", not s

One thought I had, and I realize this is a bit strange but hear me out, is postfix .. My reasoning is that . already sort of dereferences implicitly when used with structs.

var s = MyStruct{.v = 10};
var sp = &s;
var v = sp.s;

so all we'd be doing is extending this property to non-structs, really.

var s = i32(25);
var sp = &s;
var v = sp.;  //dot dereferences sp
//...
var p = Point{.x = 0, .y = 10};
var pp = &&p;
var x = p..x;  //currently (*p).x,

Advantages:

No new symbols
Consistent with current usage of symbol

Disadvantages:

Subtle, might be easy to miss
Unfamiliar to users of other languages
Confusing when used with range operators .. and ...

So then we'd need new range operators too. -> is available unless it is resurrected for return types.

Other options for deref:

Pre or postfix ^ which has noted issues with some keyboards and would be ambiguous with xor in the postfix case.
Pre or postfix $, which is not unprecedented but is admittedly kind of ugly. introduces a new symbol. could confuse people used to scripting languages.
Postfix [0], which is very C and unambiguous, but does make the pointer look like a block even though it isn't.
Postfix .0, which looks really strange and kind of implies some kind of indexing.
Postfix >, unambiguous as far as I can tell, easy to spot. Everyone seems to hate it though.

thejoshwolfe commented 6 years ago

[*N] pointer to a block of memory with comptime known length

Why not *[N]? That already means a pointer to a block of memory with comptime known length according to the rest of the proposal, and it doesn't have the ambiguity with *N.

[*N]null pointer to a block of memory with comptime known length, and a null/0 at ptr[N]

Instead of this concept, introduce [N]null which is an array of comptime known length with a null/0 at arr[N]. Then we just do *[N]null for a pointer to it.

Other options for deref:

Zig's grammar depends on knowing if we are at the end of an expression or in the middle of an operator. This means we can't have postfix operators that are identical to infix operators. Here's an example of the ambiguity using ^ as proposed for postfix pointer deref:

// this is ambiguous
const a = b^(1);

// b is a pointer to a function (or double pointer to a function),
// which is being called and given the parameter 1.
const a = (b^)(1);

// b is some integer being xor'ed with 1.
const a = b ^ 1;

We absolutely cannot have ambiguity between infix and postfix operators. This means ^, *, and > can't be used as postfix operators.

There is no problem with ambiguity of infix and prefix operators though, such as with * for deref and multiplation, - for negative and subtraction, & for address of and bitwise and, etc. The ability to distinguish between prefix and infix operators is what we get for having the above limitation with postfix and infix. And since it's so important to have - and ( as infix and postfix operators, the tradeoff to allow prefix/infix instead of postfix/infix is a no-brainer.

One thought I had, and I realize this is a bit strange but hear me out, is postfix ..

This actually does not suffer too horribly from the above ambiguity concern, because . isn't really an infix operator. After a ., you have to have an identifier, and an identifier can't ever be a postfix or infix operator. That being said, it still looks pretty horrible.

a.0 doesn't seem as bad, because at least there's something after the dot. This eliminates the problems with .. as you pointed out. But a.0 is bad because it begs the question: can I do a.1? It looks like the 0 is actually a zero, when it's really more of a keyword. 0 doesn't function like that in any other syntactic context, so this seems pretty bad too.

Personally I favor that anyway since prefix * is already a bit weird.

I'm not too concerned about prefix vs postfix, since this language (and pretty much every other language) has operator precedences that will get weird sometimes like this. Instead of moving operators around, I would propose requiring parentheses sometimes, if the confusion is bad enough. See #114.

The biggest problem I have with * meaning "pointer to" and * meaning "deref pointer" is that those are opposite directions. C has this problem, and I think it's one of the reasons children find pointers hard to learn in C (that and lack of respectable compile errors). If * means "pointer to", then we do need a new "pointer deref" operator, but it could be prefix or postfix. We could use ^ for prefix deref, or even >. There's a lot of space for new prefix operators, but less so with postfix/infix operators.

tjpalmer commented 6 years ago

Ignoring the syntax, I am really happy with the idea of separating pointer-to-1 from pointer-to-unknown-quantity and so on. (Sad that caret doesn't work easily.)

raulgrell commented 6 years ago

a.0 doesn't seem as bad, because at least there's something after the dot. This eliminates the problems with .. as you pointed out. But a.0 is bad because it begs the question: can I do a.1? It looks like the 0 is actually a zero, when it's really more of a keyword. 0 doesn't function like that in any other syntactic context, so this seems pretty bad too.

We could have a postfix .&. Borrowing @tgschultz's example:

var s = i32(25);
var sp = &s;
var v = sp.&;  // .& dereferences sp

var p = Tuple {.x = s, .y = sp, .z = &sp};
var pp = &&p;
var x: i32 = pp.&.x;  //currently (*pp).x, 
var y: i32 = pp.&.y.&; // currently *(*pp).y
var z. i32 = pp.&.z.&.&; // currently **(*p).z

It doesn't imply indexing, no ambiguities with range syntax, analogous to fields in structs: get value at field vs get value at address.

thejoshwolfe commented 6 years ago

.& looks pretty weird, and it would be the only postfix operator in zig with no variability. The other postfix operators are x{...}, x(...), x[...], x.y, which all take some form of variability as an effective parameter. But I admit I appreciate the symmetry between &x and x.& for the two different ref/deref directions.

I don't think the postfix chaining argument, e.g. ptr.&.&[0].&(0), is very strong, because I think double pointer dereferencing is pretty rare. Even better than array_pointer.&[0] would be array_point[0] that includes an implicit .&. This is already how struct_pointer.member and function_pointer(0) work. We don't want double implicit .&.&, because as claimed above, the double pointer deref usecase is pretty rare, and I think that would be too confusing.

I think .& is the best thing proposed so far for a new pointer deref operator. I'll post an updated proposal in a moment...

thejoshwolfe commented 6 years ago

(I'm writing this all up in a GitHub issue, but this is intended to go into the documentation somewhere.)

Note: In the following discussion, there is sometimes a distinction between the length of an array and the len field of a thing. An array of length n is defined to be n elements arranged consecutively in memory. The len field of a thing is defined to be whatever is convenient and meaningful in each context it appears.

This proposal does not introduce any new tokens, which means for example that [*]null is equivalent to [ * ] null.

Types

Here T represents any type, which is the "child type". N represents any expression evaluating to a comptime integer. attrs is the place in the syntax where the pointer attributes go, such as const, volatile, align(A), any combination of those, or nothing. The syntax of the attrs is not discussed here, but it's important to note where they go in each syntactic construct.

*attrs T is a pointer to exactly 1 object of type T. No pointer arithmetic.
**attrs T is equivalent to * *attrs T. This rule is only necessary because ** is a token.
[*]attrs T is a pointer to an array of T of unknown length. Pointer arithmetic enabled.
[*]null attrs T is a pointer to array of T of unknown length. There is guaranteed (language-level assertion) to be a null or 0 element in the array somewhere which denotes the last element of the array. "strings" in C APIs are this type. Pointer arithmetic enabled.
[*x]attrs T, where x is any expresesion, is a syntax error. There is a grammar rule for the start of an expression that a [ followed by a * will always be followed by ] and denote the "arithmetic pointer prefix" as defined above. (A [ followed by a ** will be a syntax error no matter what.)
[]attrs T is a struct with members ptr: [*]attrs T, len: usize, where ptr is a pointer to an array of length len. Subscripting at index i: usize effectively subscripts ptr, and is bounded (language-level assertion) by i < len.
[]null attrs T is a struct with members ptr: [*]null attrs T, len: usize, where ptr is a pointer to an array of length len + 1, and where ptr[len] is guaranteed (language-level assertion) to be null/0. Subscripting at index i: usize effectively subscripts ptr, and is bounded (language-level assertion) by i < len + 1.
[N]T an array of type T of length N. The array has a pseudo field len equal to N. Subscripting at index i: usize is bounded (language-level assertion) by i < N.
[N]null T an array of T of length N + 1 with a null/0 at index N. The array has a pseudo field len equal to N. Subscripting at index i: usize is bounded (language-level assertion) by i < N + 1.

Dereferencing

For a pointer p of any pointer type, p.& dereferences the pointer.

This operator is implied in the following contexts:

For an expression p of any pointer type, p.id is equivalent to p.&.id. Status quo.
For an expression p of any pointer type, p() is equivalent to p.&(). Status quo.
For an expression p of a pointer type with no pointer arithmetic, p[i] is effectively p.&[i]. This allows a pointer of type *attrs [N]T to behave the same as a slice of type []attrs T (and the null variants respectively) with respect to subscripting and the len field.

These implicit dereferences do not apply to an expression that is the result of applying one of these implicit dereference rules. For example, p.id is never equivalent to p.&.&.id.

Pointer Arithmetic

For a pointer p of a pointer type with pointer arithmetic enabled, the following operators are allowed, where i is any expression of type usize or isize.

p[i] is equivalent to (p + i).&.

The following infix operators are allowed, but are not precisely defined here. Informally, these operations are defined similar to C, where i is multiplied by the size of p's child type, and then added to or subtracted from the integer value of p. These operators are not commutative; the pointer operand has to be on the left. For example, i + p is a type error.

p + i
p - i
p +% i
p -% i

Pointer subtraction is also allowed in some cases. Given p1: [*]attrs1 T or p1: [*]null attrs1 T and p2: [*]attrs2 T or p2: [*]null attrs2 T, the following operators are sometimes allowed:

p2 - p1 is of type isize with a value such that @ptrToInt(p1 + (p2 - p1)) == @ptrToInt(p2). This is a type error if @typeOf(p1).alignment < @sizeOf(T) or @typeOf(p2).alignment < @sizeOf(T). This is a runtime assertion failure if p2 - p1 would be outside the range of values for type isize.
p2 -% p1 is of type isize with a value such that @ptrToInt(p1 +% (p2 -% p1)) == @ptrToInt(p2). This has the same alignment rule as p2 - p1. There is no runtime assertion.

Implicit Casting

TODO

kyle-github commented 6 years ago

@thejoshwolfe, you had slightly different syntax a couple of days ago. At the risk of bike shedding, it seems like the * before the type being pointed to makes more sense given the way the rest of the syntax works. I.e. in the rest of Zig:

T some type T.
[]T some run-time bounds-checked array of T.
*T a pointer to some type T.

So if Tis []i32, then a pointer to that should be *[]i32. Putting the * inside the [] seems inconsistent. Type modifiers are then all right associative (I think, I might be reversing that...).

If you want to make pointers to single objects a supported kind of thing in the language, perhaps make them act like transparent references:

var x: i32;
var y: &i32 = &x; //y and x refer to the same location in memory/register.
                          // y aliases x.

x = 14;
assert(y == 14)...

y= 42;
assert(x == 42)...

C++ has moved to this separation of references and pointers (even though we all know that under the hood a reference is syntactic sugar around a pointer!) and it makes a lot of code cleaner. Think about functions that take or return references. This way, there is no dereference at all for an alias/reference.

Then you can use * for pointers on which arithmetic is possible.

On a more frivolous note. Here are some other ideas for the various kinds of arrays.

[#]T - an array of T, run-time checked with a hidden fixed bound. Is undefined at declaration.
[42]T - where 42 is some integer number, an array of T with a bound known at compile time. Is defined at declaration. No hidden bound stored since it is known at compile time.
[?]T - is a zero-terminated array. No hidden bound. The "zero" value depends on the type T. The length is calculated at runtime. Should this be the total memory occupied or one less like C? I used ? because that is already used to mean "might be null". Season to taste.

The first one is identical to a C99 dynamic array, the second to a normal C array and the third can be used for C strings.

So then you get:

[10]&i32 - an array of ten references to i32.
[#]&i32 - an array of references to i32, size known at run time.
[?]u8 - a C-style string.
*i32 - a pointer, with arithmetic, to i32 values.

var x:i32;
var y:&i32 = &x;
y = x; // meaningless since x and y are aliased.
x = 42; // now y == 42 as well.
var a = x + 1; // perfectly legal, a is now 43, no pointer action here.

var p: [?]u8 = func_that_returns_a_c_string(blah); 
var q: *u8 = *p[42]; // not sure about this...
q = p +1; // valid

Apart from the frivolity, I really like the idea of having a pointer to one object (otherwise known as a reference or alias in other languages) and a pointer on which arithmetic can be done. This is really nice!

raulgrell commented 6 years ago

@thejoshwolfe the proposal looks great, though I almost thought the .& was too ugly to propose!

While we're discussing different kinds of arrays, what do you think of an enum array? It has a length equal to the member count of the enum and can only be indexed with an enum value.

const Axis = enum { X, Y, Z};
const vec3 = [Axis]f32 {0.0, 0.0, 0.0};
vec3[Axis.X] == 0.0;

You can get close to this with status quo Zig by specifying the tag type and casting

const Value = enum(u2) {  Zero, One, Two };
const vals = [3]i32{3, 4, 5};
vals[u2(Value.Zero)] == 3;

But then if you change the number of elements, the backing type of the enum or override the values, you need to change a lot of code. And you could still access it with arbitrary integers, so if at any point the index into the array was hardcoded, it would have to be found.

An enum array basically becomes a comptime-checked map!

It could be approximately implemented in userland with something like this if we had a memberIndex built-in or something:

fn EnumArray(comptime T: type, comptime U: type) -> type {
    return struct {
        data: [@memberCount(T)]U,

        const Self = this;

        fn get(&self: Self, tag: t) U {
            return data[@memberIndex(t)];
        }

        fn set(&self: Self, tag: t, value u) void {
            data[@memberIndex(t)] = u;
        }
    }
}

const value_map = EnumArray(Value, i32) {
    .data = []i32{3, 4, 5}
}

value_map.get(Value.Zero) == 3;
value_map.set(Value.Two, 10);

thejoshwolfe commented 6 years ago

While we're discussing different kinds of arrays,

I broke that out into its own issue: #793

jido commented 6 years ago

[N]null T an array of T of length N + 1 with a null/0 at index N. The array has a pseudo field len equal to N. Subscripting at index i: usize is bounded (language-level assertion) by i < N + 1.

Does the null/0 have to be at index N in that case? C strings are stored in fixed length array but the string length can vary, it is not necessarily equal to the array size. The same applies to null-terminated C arrays.

KingOfThePirates commented 6 years ago

I usually wouldn't comment when I don't have competence in the area, in this case pointers, but I feel compelled to share my abstract thoughts. I hope one of these brainstorming ideas could either be or lead to useful ideas:

*variable_name to set. /variable_name to deference

1*variable_name could work as a singular pointer *variable_name: T is a C pointer const &variable_name is a C++ reference pointer *variable_name = Pointer.deferAlloc() for runtime (in concept, it doesn't have to a class with a method - but I felt it didn't fit in the same realm as single character symbols next to the asterisk) @variable_name deletes a pointer and also sets it to null, and maybe do more pointer management **variable_name get the address

T is also where you could put your brackets if you need to declare the type as an array

Speaking of arrays, something like [@] or []@ could mean a C string. @ seems to be the easiest symbol for null after looking at some wikipedia pages.

andrewrk commented 6 years ago

[x] change deref syntax to x.*
[x] change pointer syntax from & to *
[x] add syntax for [*] pointers
[x] disable indexing for single-item pointers
[x] add pointer arithmetic for unknown length pointers
[x] enable indexing on *[N]T. e.g. array_ptr[1] and array_ptr.len
[x] disable slicing for single-item pointers
[x] disable field access for unknown length pointers
[x] disable x.* deref for [*]T (instead must use x[0])
[x] disable implicit cast from T to [*]const T
[x] update @typeInfo for pointers
[x] add implicit casting
- [x] *[N]T to []T
- [x] *[N]T to [*]T
[x] look over ir.cpp at ir_types_match_with_implicit_cast and make sure no implicit casts are done with the wrong pointer length.

andrewrk commented 6 years ago

I just pushed 96164ce61377b36bcaf0c4087ca9b1ab822b9457 which disables indexing for single-item pointers and enables pointer arithmetic for unknown length pointers.

ziglang / zig

Pointer Reform #770

Progress: https://github.com/ziglang/zig/issues/770#issuecomment-394069958

Types

Dereferencing

Pointer Arithmetic

Implicit Casting