Closed andrewrk closed 6 years ago
If possible I would like to have some other symbol/keyword for:
^ pointer to exactly 1 thing.
This is because ^ isn't ergonomic on all keyboard layouts. On Windows with some international/non-English keyboard layouts you have to type it twice to get ^^ and then you have to remove the extra every time you write it. See this superuser question Sure it is used in bitwise xor, but you tend to have more pointers in your program than bitwise xors.
That's good to know. Do you have a suggestion for what other symbol to use?
To be honest I have only few possibly reasonable ideas as most commonly used or reasonable sigils have been used for something in Zig already.
Is the new array syntax flexibly like multiply? Can I do these?
var array1: i32*4 = undefined;
var array2: 4*4*i32 = undefined;
var array3: 4*(4*i32) = undefined;
Or is it restricted to <comptime_int>*<type>
? Aka, I would have to write array2
like this:
var array2: (4*4)*i32 = undefined;
@Hejsil great questions. I think that new array syntax is no good because of this. But we have to make something different than [N]T to distinguish from pointers.
4*i32
is no good, because the *
operator elsewhere doesn't change the type of something. Even **
turns arrays with a child type into another array with the same child type; it only changes the size. We need a way to turn a scalar into an array, which is not like any infix operator in the language.
Perhaps array[4]i32
where array
is a keyword? Having a [4]i32
somewhere in there fits nicely with [4]i32
being a pointer to such.
I expect people will mistakenly declare their structs with [4]i32
instead of whatever the actual array syntax ends up being. Then they'll initialize the pointer to undefined
thinking they initialized the elements to undefined
, and then begin assigning into the elements, which will cause undefined behavior at runtime. Seems like a footgun.
I agree with @Ilariel, my keyboard is one of these (Portuguese) - If you press the ^
key once, followed by a vowel you get, for example, â. But with a consonant, you get for example, ^w. But tapping twice and erasing can break your typing flow - It's been surprisingly hard to build muscle memory for it. Other characters like this include the tilde ~
and the backtick/grave accent. The caret/circumflex accent ^
requires shift to be pressed and is on the same key as the tilde. The backtick/grave is on the same key as the very similar looking acute accent. I'm not sure about other keyboard layouts but these are very uncomfortable characters for the Portuguese.
I also liked @Ilariel's 2nd and 3rd suggestions. I don't dislike the idea of a ref/ptr keyword, but I find &?&&?T more readable than ref ? ref ref ? T for types. I think I'd prefer to keep the & as a reference type and instead use a builtin or an operator like #
or $
for address-of.
As for the array syntax, would N[]T
be possible? An array is a pointer to a block of memory with runtime known length just like status quo slices, so it's conceptually consistent at least.
const a: 3[]u32 = 3[]u32{1, 2, 3};
const a = 3[]u32{1, 2, 3};
const b = (2*2)[]u32{1, 2, 3, 4};
@thejoshwolfe made a good point about *
not changing something's type, and **
changing something's size. But changing an array of one size into an array of another size is also changing the type of the array. Why not use **
itself?
const a: 3**u32 = 3**u32{1, 2, 3};
const a = 3**u32{1, 2, 3};
const b = (2*2)**u32{1, 2, 3, 4};
const c = a ++ b;
It communicates that you're creating a value type which is the result of putting N units of that value type together. If you allow **
and ++
to operate on scalars, the following could create an anonymous tuple.
const T = 4**u32 ++ bool
const t : T = 4**u32{1,2,3,4} ++ true;
const t = 4**u32{1,2,3,4} ++ true;
We can't use **
for making arrays out of scalars, because types are comptime values.
const State = 256**u8; // formerly known as [256]u8
const States = 4 ** State; // is this [1024]u8 or [4][256]u8?
I thought the both the ++
and **
operator were already only available with comptime values. In the example you gave, perhaps:
const State = 256**u8; // formerly known as [256]u8
const States = 4**State; // This would be [4][256]u8, ie (4 **(256**u8))
const States = (4 * State.len) ** State.child_type // this is [1024]u8
Though I can see how this isn't ideal.
EDIT
Just to consolidating my 2 cents to the pointer-to discussion after a bit of thought.
In C you have &
as the address-of operator, and pointers are declared with a *
In C++ you have std::addressof, and references are declared with a &
In Zig, the 'pointer to exactly one thing' is closer to a C++ reference than a C pointer, so it would make sense to stay close to their interface, and use a builtin @addressOf function.
So it is documented and can be discussed here, we came up a possible solution in IRC:
*
pointer to exactly 1 thing.[*]
pointer to a block of memory of unknown length[*]null
pointer to block of memory, null-terminated (or 0 terminated for integers). #265[]
pointer to a block of memory with runtime known length. status quo slices.[]null
pointer to a block of memory with runtime known length, with a null/0 at ptr[len][*N]
pointer to a block of memory with comptime known length[*N]null
pointer to a block of memory with comptime known length, and a null/0 at ptr[N][N]
block of memory (an array).Advantages:
*
is a familiar symbol for "pointer".*
doesn't have the keyboard issues ^
does.*
,[*]
,[*N]
are consistent.[ ]
Disadvantages:
*
even though it points to data 🤷♂️ *
in [*N]
is ambiguous with *
as a dereference operator So if we went this route we'd need a new dereferencing operator. Personally I favor that anyway since prefix *
is already a bit weird.
var z = x**y; //x * (*y)
var a = *s.m.v; //dereferences "v", not s
One thought I had, and I realize this is a bit strange but hear me out, is postfix .
. My reasoning is that .
already sort of dereferences implicitly when used with structs.
var s = MyStruct{.v = 10};
var sp = &s;
var v = sp.s;
so all we'd be doing is extending this property to non-structs, really.
var s = i32(25);
var sp = &s;
var v = sp.; //dot dereferences sp
//...
var p = Point{.x = 0, .y = 10};
var pp = &&p;
var x = p..x; //currently (*p).x,
Advantages:
Disadvantages:
..
and ...
So then we'd need new range operators too. ->
is available unless it is resurrected for return types.
Other options for deref:
^
which has noted issues with some keyboards and would be ambiguous with xor in the postfix case.$
, which is not unprecedented but is admittedly kind of ugly. introduces a new symbol. could confuse people used to scripting languages.[0]
, which is very C and unambiguous, but does make the pointer look like a block even though it isn't..0
, which looks really strange and kind of implies some kind of indexing.>
, unambiguous as far as I can tell, easy to spot. Everyone seems to hate it though.
[*N]
pointer to a block of memory with comptime known length
Why not *[N]
? That already means a pointer to a block of memory with comptime known length according to the rest of the proposal, and it doesn't have the ambiguity with *N
.
[*N]null
pointer to a block of memory with comptime known length, and a null/0 at ptr[N]
Instead of this concept, introduce [N]null
which is an array of comptime known length with a null/0 at arr[N]. Then we just do *[N]null
for a pointer to it.
Other options for deref:
Zig's grammar depends on knowing if we are at the end of an expression or in the middle of an operator. This means we can't have postfix operators that are identical to infix operators. Here's an example of the ambiguity using ^
as proposed for postfix pointer deref:
// this is ambiguous
const a = b^(1);
// b is a pointer to a function (or double pointer to a function),
// which is being called and given the parameter 1.
const a = (b^)(1);
// b is some integer being xor'ed with 1.
const a = b ^ 1;
We absolutely cannot have ambiguity between infix and postfix operators. This means ^
, *
, and >
can't be used as postfix operators.
There is no problem with ambiguity of infix and prefix operators though, such as with *
for deref and multiplation, -
for negative and subtraction, &
for address of and bitwise and, etc. The ability to distinguish between prefix and infix operators is what we get for having the above limitation with postfix and infix. And since it's so important to have -
and (
as infix and postfix operators, the tradeoff to allow prefix/infix instead of postfix/infix is a no-brainer.
One thought I had, and I realize this is a bit strange but hear me out, is postfix
.
.
This actually does not suffer too horribly from the above ambiguity concern, because .
isn't really an infix operator. After a .
, you have to have an identifier, and an identifier can't ever be a postfix or infix operator. That being said, it still looks pretty horrible.
a.0
doesn't seem as bad, because at least there's something after the dot. This eliminates the problems with ..
as you pointed out. But a.0
is bad because it begs the question: can I do a.1
? It looks like the 0
is actually a zero, when it's really more of a keyword. 0
doesn't function like that in any other syntactic context, so this seems pretty bad too.
Personally I favor that anyway since prefix
*
is already a bit weird.
I'm not too concerned about prefix vs postfix, since this language (and pretty much every other language) has operator precedences that will get weird sometimes like this. Instead of moving operators around, I would propose requiring parentheses sometimes, if the confusion is bad enough. See #114.
The biggest problem I have with *
meaning "pointer to" and *
meaning "deref pointer" is that those are opposite directions. C has this problem, and I think it's one of the reasons children find pointers hard to learn in C (that and lack of respectable compile errors). If *
means "pointer to", then we do need a new "pointer deref" operator, but it could be prefix or postfix. We could use ^
for prefix deref, or even >
. There's a lot of space for new prefix operators, but less so with postfix/infix operators.
Ignoring the syntax, I am really happy with the idea of separating pointer-to-1 from pointer-to-unknown-quantity and so on. (Sad that caret doesn't work easily.)
a.0 doesn't seem as bad, because at least there's something after the dot. This eliminates the problems with .. as you pointed out. But a.0 is bad because it begs the question: can I do a.1? It looks like the 0 is actually a zero, when it's really more of a keyword. 0 doesn't function like that in any other syntactic context, so this seems pretty bad too.
We could have a postfix .&
. Borrowing @tgschultz's example:
var s = i32(25);
var sp = &s;
var v = sp.&; // .& dereferences sp
var p = Tuple {.x = s, .y = sp, .z = &sp};
var pp = &&p;
var x: i32 = pp.&.x; //currently (*pp).x,
var y: i32 = pp.&.y.&; // currently *(*pp).y
var z. i32 = pp.&.z.&.&; // currently **(*p).z
It doesn't imply indexing, no ambiguities with range syntax, analogous to fields in structs: get value at field vs get value at address.
.&
looks pretty weird, and it would be the only postfix operator in zig with no variability. The other postfix operators are x{...}
, x(...)
, x[...]
, x.y
, which all take some form of variability as an effective parameter. But I admit I appreciate the symmetry between &x
and x.&
for the two different ref/deref directions.
I don't think the postfix chaining argument, e.g. ptr.&.&[0].&(0)
, is very strong, because I think double pointer dereferencing is pretty rare. Even better than array_pointer.&[0]
would be array_point[0]
that includes an implicit .&
. This is already how struct_pointer.member
and function_pointer(0)
work. We don't want double implicit .&.&
, because as claimed above, the double pointer deref usecase is pretty rare, and I think that would be too confusing.
I think .&
is the best thing proposed so far for a new pointer deref operator. I'll post an updated proposal in a moment...
(I'm writing this all up in a GitHub issue, but this is intended to go into the documentation somewhere.)
Note: In the following discussion, there is sometimes a distinction between the length of an array and the len
field of a thing. An array of length n is defined to be n elements arranged consecutively in memory. The len
field of a thing is defined to be whatever is convenient and meaningful in each context it appears.
This proposal does not introduce any new tokens, which means for example that [*]null
is equivalent to [ * ] null
.
Here T
represents any type, which is the "child type". N
represents any expression evaluating to a comptime integer. attrs
is the place in the syntax where the pointer attributes go, such as const
, volatile
, align(A)
, any combination of those, or nothing. The syntax of the attrs
is not discussed here, but it's important to note where they go in each syntactic construct.
*attrs T
is a pointer to exactly 1 object of type T
. No pointer arithmetic.**attrs T
is equivalent to * *attrs T
. This rule is only necessary because **
is a token.[*]attrs T
is a pointer to an array of T
of unknown length. Pointer arithmetic enabled.[*]null attrs T
is a pointer to array of T
of unknown length. There is guaranteed (language-level assertion) to be a null
or 0
element in the array somewhere which denotes the last element of the array. "strings" in C APIs are this type. Pointer arithmetic enabled.[*x]attrs T
, where x
is any expresesion, is a syntax error. There is a grammar rule for the start of an expression that a [
followed by a *
will always be followed by ]
and denote the "arithmetic pointer prefix" as defined above. (A [
followed by a **
will be a syntax error no matter what.)[]attrs T
is a struct with members ptr: [*]attrs T, len: usize
, where ptr
is a pointer to an array of length len
. Subscripting at index i: usize
effectively subscripts ptr
, and is bounded (language-level assertion) by i < len
.[]null attrs T
is a struct with members ptr: [*]null attrs T, len: usize
, where ptr
is a pointer to an array of length len + 1
, and where ptr[len]
is guaranteed (language-level assertion) to be null
/0
. Subscripting at index i: usize
effectively subscripts ptr
, and is bounded (language-level assertion) by i < len + 1
.[N]T
an array of type T
of length N
. The array has a pseudo field len
equal to N
. Subscripting at index i: usize
is bounded (language-level assertion) by i < N
.[N]null T
an array of T
of length N + 1
with a null
/0
at index N
. The array has a pseudo field len
equal to N
. Subscripting at index i: usize
is bounded (language-level assertion) by i < N + 1
.For a pointer p
of any pointer type, p.&
dereferences the pointer.
This operator is implied in the following contexts:
p
of any pointer type, p.id
is equivalent to p.&.id
. Status quo.p
of any pointer type, p()
is equivalent to p.&()
. Status quo.p
of a pointer type with no pointer arithmetic, p[i]
is effectively p.&[i]
. This allows a pointer of type *attrs [N]T
to behave the same as a slice of type []attrs T
(and the null
variants respectively) with respect to subscripting and the len
field.These implicit dereferences do not apply to an expression that is the result of applying one of these implicit dereference rules. For example, p.id
is never equivalent to p.&.&.id
.
For a pointer p
of a pointer type with pointer arithmetic enabled, the following operators are allowed, where i
is any expression of type usize
or isize
.
p[i]
is equivalent to (p + i).&
.The following infix operators are allowed, but are not precisely defined here. Informally, these operations are defined similar to C, where i
is multiplied by the size of p
's child type, and then added to or subtracted from the integer value of p
. These operators are not commutative; the pointer operand has to be on the left. For example, i + p
is a type error.
p + i
p - i
p +% i
p -% i
Pointer subtraction is also allowed in some cases. Given p1: [*]attrs1 T
or p1: [*]null attrs1 T
and p2: [*]attrs2 T
or p2: [*]null attrs2 T
, the following operators are sometimes allowed:
p2 - p1
is of type isize
with a value such that @ptrToInt(p1 + (p2 - p1)) == @ptrToInt(p2)
. This is a type error if @typeOf(p1).alignment < @sizeOf(T) or @typeOf(p2).alignment < @sizeOf(T)
. This is a runtime assertion failure if p2 - p1
would be outside the range of values for type isize
.p2 -% p1
is of type isize
with a value such that @ptrToInt(p1 +% (p2 -% p1)) == @ptrToInt(p2)
. This has the same alignment rule as p2 - p1
. There is no runtime assertion.TODO
@thejoshwolfe, you had slightly different syntax a couple of days ago. At the risk of bike shedding, it seems like the *
before the type being pointed to makes more sense given the way the rest of the syntax works. I.e. in the rest of Zig:
T
some type T
.[]T
some run-time bounds-checked array of T
.*T
a pointer to some type T
.So if T
is []i32
, then a pointer to that should be *[]i32
. Putting the *
inside the []
seems inconsistent. Type modifiers are then all right associative (I think, I might be reversing that...).
If you want to make pointers to single objects a supported kind of thing in the language, perhaps make them act like transparent references:
var x: i32;
var y: &i32 = &x; //y and x refer to the same location in memory/register.
// y aliases x.
x = 14;
assert(y == 14)...
y= 42;
assert(x == 42)...
C++ has moved to this separation of references and pointers (even though we all know that under the hood a reference is syntactic sugar around a pointer!) and it makes a lot of code cleaner. Think about functions that take or return references. This way, there is no dereference at all for an alias/reference.
Then you can use *
for pointers on which arithmetic is possible.
On a more frivolous note. Here are some other ideas for the various kinds of arrays.
?
because that is already used to mean "might be null". Season to taste.The first one is identical to a C99 dynamic array, the second to a normal C array and the third can be used for C strings.
So then you get:
[10]&i32 - an array of ten references to i32.
[#]&i32 - an array of references to i32, size known at run time.
[?]u8 - a C-style string.
*i32 - a pointer, with arithmetic, to i32 values.
var x:i32;
var y:&i32 = &x;
y = x; // meaningless since x and y are aliased.
x = 42; // now y == 42 as well.
var a = x + 1; // perfectly legal, a is now 43, no pointer action here.
var p: [?]u8 = func_that_returns_a_c_string(blah);
var q: *u8 = *p[42]; // not sure about this...
q = p +1; // valid
Apart from the frivolity, I really like the idea of having a pointer to one object (otherwise known as a reference or alias in other languages) and a pointer on which arithmetic can be done. This is really nice!
@thejoshwolfe the proposal looks great, though I almost thought the .&
was too ugly to propose!
While we're discussing different kinds of arrays, what do you think of an enum array? It has a length equal to the member count of the enum and can only be indexed with an enum value.
const Axis = enum { X, Y, Z};
const vec3 = [Axis]f32 {0.0, 0.0, 0.0};
vec3[Axis.X] == 0.0;
You can get close to this with status quo Zig by specifying the tag type and casting
const Value = enum(u2) { Zero, One, Two };
const vals = [3]i32{3, 4, 5};
vals[u2(Value.Zero)] == 3;
But then if you change the number of elements, the backing type of the enum or override the values, you need to change a lot of code. And you could still access it with arbitrary integers, so if at any point the index into the array was hardcoded, it would have to be found.
An enum array basically becomes a comptime-checked map!
It could be approximately implemented in userland with something like this if we had a memberIndex built-in or something:
fn EnumArray(comptime T: type, comptime U: type) -> type {
return struct {
data: [@memberCount(T)]U,
const Self = this;
fn get(&self: Self, tag: t) U {
return data[@memberIndex(t)];
}
fn set(&self: Self, tag: t, value u) void {
data[@memberIndex(t)] = u;
}
}
}
const value_map = EnumArray(Value, i32) {
.data = []i32{3, 4, 5}
}
value_map.get(Value.Zero) == 3;
value_map.set(Value.Two, 10);
While we're discussing different kinds of arrays,
I broke that out into its own issue: #793
[N]null T an array of T of length N + 1 with a null/0 at index N. The array has a pseudo field len equal to N. Subscripting at index i: usize is bounded (language-level assertion) by i < N + 1.
Does the null/0 have to be at index N in that case? C strings are stored in fixed length array but the string length can vary, it is not necessarily equal to the array size. The same applies to null-terminated C arrays.
I usually wouldn't comment when I don't have competence in the area, in this case pointers, but I feel compelled to share my abstract thoughts. I hope one of these brainstorming ideas could either be or lead to useful ideas:
*variable_name
to set. /variable_name
to deference
1*variable_name
could work as a singular pointer
*variable_name: T
is a C pointer
const &variable_name
is a C++ reference pointer
*variable_name = Pointer.deferAlloc()
for runtime (in concept, it doesn't have to a class with a method - but I felt it didn't fit in the same realm as single character symbols next to the asterisk)
@variable_name
deletes a pointer and also sets it to null, and maybe do more pointer management
**variable_name
get the address
T is also where you could put your brackets if you need to declare the type as an array
Speaking of arrays, something like [@]
or []@
could mean a C string. @ seems to be the easiest symbol for null after looking at some wikipedia pages.
x.*
&
to *
[*]
pointersx.*
deref for [*]T
(instead must use x[0]
)T
to [*]const T
@typeInfo
for pointers*[N]T
to []T
*[N]T
to [*]T
ir_types_match_with_implicit_cast
and make sure no implicit casts are done with the wrong pointer length.I just pushed 96164ce61377b36bcaf0c4087ca9b1ab822b9457 which disables indexing for single-item pointers and enables pointer arithmetic for unknown length pointers.
EDIT
Progress: https://github.com/ziglang/zig/issues/770#issuecomment-394069958
&
only used for address-of, no longer designates a pointer type. Necessary because of #588^
pointer to exactly 1 thing.[*]
pointer to a block of memory of unknown length[*]null
pointer to block of memory, null-terminated (or 0 terminated for integers). #265[]
pointer to a block of memory with runtime known length. status quo slices.[]null
pointer to a block of memory with runtime known length, with a null/0 at ptr[len][N]
pointer to a block of memory with comptime known length[N]null
pointer to a block of memory with comptime known length, and a null/0 at ptr[N]All of them support pointer indexing and slicing except
^
. Only[*]
supports pointer arithmetic. All of them implicitly cast to[*]
.[]null
and[N]null
implicitly cast to[*]null
.&ptr[x]
and&foo
always gives a^
.ptr[x..y]
with comptime known x and y gives a[N]
.array[x..]
gives a[N]
.new array syntax
Now it is clear whether you should do
&array
or&array[0]
. Don't use&array
. If you want a[N]T
, e.g. a pointer with comptime known length, usearray[0..]
. If the function wants to access more than one element, you'll do this. Otherwise,&array[0]
, will give^T
, which would trigger a compile error if the array was length 0, and only this element can be accessed via this pointer.This paves the way for #733 See also #386 See also #568