enum-backed address spaces

andrewrk commented 4 weeks ago

Problem Statement

The address zero (0) is sometimes mapped. This is why allowzero exists. However it is also the case that other parts of the address range in any given space are unmapped. In such case, those nonzero unmapped values should be candidates for being the null value, and they should be available for packing data into pointers in a type-safe manner.

As an example, on amdgcn it would be ideal for an optional pointer have the same size as a non-optional pointer while using the value 0xFFFFFFFF for null.

Furthermore, it would be ideal for pointers to take up only the correct number of bits in a packed struct and allow bit packing when used as peers of align(0) fields in auto-layout structs.

Proposal

This proposal depends on new enum syntax for marking ranges of integer values illegal.

The x86_64-linux address space would be defined like this:

pub const Generic = enum(u64) {
    unreachable = 0x0000000000000...0x00007fefffffffff,
    unreachable = 0x1000000000000...0xffffffffffffffff,
    _,
};

On x86_64-freestanding it might instead be defined like this, since the pages at the beginning are mapped, but the hardware is still limited to 48 bits:

pub const Generic = enum(u64) {
    unreachable = 0x1000000000000...0xffffffffffffffff,
    _,
};

allowzero is no longer needed because it is communicated by the valid range of the enum.

By making value ranges unreachable, it means the language is free to pack data into those unused integer values when constructing types such as optionals or error unions. It also means that @ptrFromInt gains an additional safety check, ensuring the value is in-range. Notice that 0xaaaaaaaaaaaaaaaa is outside the valid pointer range on this very common triple.

usize would be redefined as the tag type of the default address space. Pointers carry address space data, so by indexing into a slice in a given address space, the result location type of the element index (i.e. ptr[i]) would be the tag type of the respective address space.

This is almost sufficient to address the problem statement, however, we need well-defined memory layout for pointers, including null pointers. So, an additional part of this proposal is recognizing the tag null in an address space enum:

/// x86-64 example
pub const Generic = enum(u64) {
    null        = 0x0000000000000,
    unreachable = 0x0000000000001...0x00007fefffffffff,
    unreachable = 0x1000000000000...0xffffffffffffffff,
    _,
};

/// amdgcn example
pub const Local = enum(u32) {
    null = 0xffffffff,
    _,
};

This also opens the door to automated bit-packing for auto-layout structs when pointers along with sibling fields use align(0):

struct {
    /// Note the alignment here is a property of the field, not the pointer
    ptr: *anyopaque align(0),
    flag_a: bool align(0),
    flag_b: bool align(0),
}

In this case, using the above x86_64-linux address space definition, it would be legal, but not required, for a zig compiler to lower the struct with a memory layout that uses 8 bytes, packing the booleans into the unused integer value ranges. It also provides opportunity for the compiler to strategize around ensuring that the 0xAA bit pattern is unambiguously detectable as an invalid state by safety checks.

Each target would have a default pointer address space. When used in pointer syntax, it would be equivalent to omitting it. i.e. for x86_64-linux, *addrspace(Generic) T == *T.

Implementation Details

std.builtin.AddressSpace would change from an enum to something like this:

pub const AddressSpace = switch (target.cpu.arch) {
    .x86_64 => switch (target.os.tag) {
        .linux => struct {
            pub const Generic = enum(u64) {
                null        = 0x0000000000000,
                unreachable = 0x0000000000001...0x00007fefffffffff,
                unreachable = 0x1000000000000...0xffffffffffffffff,
                _,
            };
        },
        // ...
    },
    .amdgcn => struct {
        pub const Generic = enum(u64) {
            null = 0,
            _,
        };
        pub const Local = enum(u32) {
            null = 0xffffffff,
            _,
        };
        /// All address spaces mapped
        pub const Region = enum(u32) {
            _,
        };
        // ...
    },
    // ...
};

A Zig compiler would have hard-coded awareness of the address space names within this namespace and how to map them to e.g., an LLVM address space number.

The address spaces would be user overridable in the root source file. This would be especially useful for a freestanding target.

cdurkin commented 4 weeks ago

As an embedded C engineer, I'm just as concerned about null pointer bugs as i am about bad pointers pointing to invalid memory locations, so if this allows pointer safety checks against the target memory map then i think it's a great idea 👍

rohlem commented 4 weeks ago

I like the non-intrusive optimization and debugging potential of this general idea. What's the value of declaring the practically-48-bit pointer as enum(u64) rather than enum(u48) and having the compiler auto-extend the value behind the scenes? I'm guessing it would ultimately be connected to the platform (currently C) ABI? (We could introduce extern address spaces to distinguish them from program-internal ones. Would custom address spaces be on the table? Somewhat similar idea to distinct primitive types.)

What would the decision graph look like for using align(0) vs not using it? Ideally I'd want the compiler to choose for me in most cases (outside of extern / ABI compliance). Users (of Zig and of Zig libraries) may want to allow users to benchmark either option. Currently I'm imagining code with align(x) after every field to become a bit noisy.

andrewrk commented 4 weeks ago

What's the value of declaring the practically-48-bit pointer as enum(u64) rather than enum(u48) and having the compiler auto-extend the value behind the scenes?

That could be a reasonable thing to do. Using u64 would make it a little bit closer to status quo. I think people might be surprised if @bitSizeOf(usize) == 48 but then again that is accurate for that target.

What would the decision graph look like for using align(0) vs not using it?

In many cases you need an aligned pointer in order to do anything. Consider all the functions in a given codebase that accept a *T. You would not be able to pass a sub-aligned pointer as such a parameter.

It's the same reason you would use 2 bools vs a bitmask. Choose between tighter storage, or fewer instructions to load and store the value. In a sense, it's the same decision as choosing how much compression to use when storing data.

Types have default alignment so that pointers to them can be used interchangeably and so that loads and stores generally correspond to a single machine instruction.

ikskuh commented 4 weeks ago

This looks like a nice solution. One question: Who and how is decided what the default address space for data pointers and function pointers is? On harvard architectures, these two kinds of objects reside in different address spaces, and introducing a common/shared space is a suboptimal solution

alexrp commented 4 weeks ago

I think you're onto something here. I really like the idea of making allowzero a property of the address space, and the default address space a property of the target by default (with the option of being user-provided). I have thought before that the current design of allowzero isn't very user-friendly for freestanding developers because the vast majority of their pointers have to adorned with it.

One major concern I have here is hardware pointer tagging features like Arm's Top Byte Ignore, Intel's Linear Address Masking, AMD's Upper Address Ignore, etc. When in use, these features make it so that, in general, you can no longer assume that the upper bits of user-space pointers are unimportant and discardable.

The trouble with these is that they're enabled dynamically by a syscall to arch_prctl(). You could easily imagine an application innocently calling into, say, a JavaScript engine that opportunistically makes use of hardware pointer tagging when available, and things silently breaking on the Zig side because the pointer tag gets discarded.

Are we confident that we can implement sufficient safety checks for a user to be made aware that they need to define a custom generic address space when using hardware pointer tagging?

This also opens the door to automated bit-packing for auto-layout structs when pointers along with sibling fields use align(0):

I'm currently very confused about whether, in general, align(0) is supposed to mean "default ABI alignment, as if align wasn't used" or "no alignment whatsoever, the compiler can go crazy". Sema doesn't seem to agree with itself on this in all cases. We also have no tests and no docs for this.

This is almost sufficient to address the problem statement, however, we need well-defined memory layout for pointers, including null pointers. So, an additional part of this proposal is recognizing the tag null in an address space enum:

Is the implication that if a null tag is not defined, the address space just doesn't have a notion of null pointers, and the compiler must be able to deal with this? (Presumably by using fat pointers for ?*T.)

mlugg commented 4 weeks ago

There are some interesting discussions going on, but I would also like to add a quick bikeshed. Rather than the unreachable = ... syntax you've proposed, I think it makes more sense to specify this in reverse, i.e. define which ranges are valid. This can be done with IMO a much more elegant syntax, by associating a range of values with _:

pub const Generic = enum(u64) {
    _ = 0x00007ff0_00000000...0x0000ffff_ffffffff,
};

And you can specify multiple valid ranges by writing _ multiple times. For a random fictional architecture:

pub const Something = enum(u64) {
    null = 0xffffffff_ffffffff,
    _ = 0x00000000_00000000...0x00000000_ffffffff,
    _ = 0x10000000_00000000...0x10000000_ffffffff,
};

EDIT: to be clear, _ with no given value range retains its existing meaning of "all remaining backing values".

Snektron commented 3 weeks ago

This looks like a nice solution. One question: Who and how is decided what the default address space for data pointers and function pointers is? On harvard architectures, these two kinds of objects reside in different address spaces, and introducing a common/shared space is a suboptimal solution

Semantic analysis currently already has a notion of "default address space in a particular context". The namespace returned by the switches in the original proposal could be required to return a set of common ones which the compiler can then use on a particular location. For example

pub const AddressSpace = switch (target.cpu.arch) {
    .x86_64 => switch (target.os.tag) {
        .linux => struct {
            // Default used for variables
            pub const Data = Generic;
            // Default used for constants
            pub const Constant = Generic;
            // Default used for functions
            pub const Code = Generic;

            // Architecture specific...
            pub const Generic = enum(u64) { ... };
        },
        // ...
    },
    .amdgcn => struct {
        // Variables are instance-local by default. 
        pub const Data = Private;
        pub const Constant = ...;
        // We can provide a nicer error message than "expected type 'builtin.AddressSpace', found '@TypeOf(.enum_literal)'"
        pub const Code = @compileError("this architecture doesn't support function pointers");

        // Architecture specific...
        pub const Flat = enum(u40) { ... };
        pub const Private = ...;
        // ...
    },
    .avr => struct {
      pub const Data = Ram;
      pub const Constant = Flash;
      pub const Code = Flash;
      // ...
    },
    // ...
};

I think *T would then be the same as *addrspace(.Data) T since thats usually whats intended, but im not 100% sure if thats correct. Perhaps it makes sense to explicitly set a Default?

Snektron commented 3 weeks ago

Rather than the unreachable = ... syntax you've proposed, I think it makes more sense to specify this in reverse, i.e. define which ranges are valid. This can be done with IMO a much more elegant syntax, by associating a range of values with _:

I think this is a decent proposal in itself. I wonder if there is some more general synergy here with ranged ints: For example

const X = enum {
  a = 0x00 ... 0xFF,
};

switch (x) {
  .a => |a| ..., // `a` is a ranged int 0x00 ... 0xFF
}

ziglang / zig