parse inline assembly syntax according to a set of dialects; integrate inline assembly more closely with the zig language

andrewrk commented 2 years ago

Currently we have this situation:

stage1: Inline assembly is a comptime-known string that can be built with expressions such as ++.
stage2: Inline assembly must be string literals. This is in preparation for this proposal, and here it is.

Here's one example of what inline assembly looks like today, for x86_64:

argc_argv_ptr = asm volatile (
    \\ xor %%rbp, %%rbp
    : [argc] "={rsp}" (-> [*]usize),
);

This proposal is to introduce the concept of dialects. As a first pass, the set of dialects would be exactly the std.Target.Cpu.Arch enum. But it's likely that some dialects would be shared by multiple architectures. For example, x86 and x86_64 would probably share the x86 dialect. So we will have a separate enum for dialects.

A dialect is specified as an identifier token (it must be an identifier) directly after the asm keyword, before the volatile keyword if any, and it tells how to parse the assembly syntax:

const argc_argv_ptr: [*]usize = asm x86 volatile {
    xor rbp, rbp  // zig-style comments for all dialects
    break rsp // we can make up our own syntax too for integration with zig language
};

I made some other changes here for fun but that's outside the scope of this proposal; this proposal is pointing out that we change the ( ) to braces and inside there is not a string literal but syntax that is more closely integrated with the zig language.

The tokenizer is shared between Zig syntax and all dialects. One tokenizer to rule them all.

The dialect tells the parser how to parse what is inside the braces. You can imagine how x86 is parsed in a drastically different manner than WebAssembly or SPIR-V.

Rather than the burden of parsing inline assembly falling on the backend, it falls on the frontend, where it is properly cached and it is easier to report errors. This also provides a way to unify inline assembly across multiple backends; for example right now we send inline assembly straight to LLVM with the LLVM backend, but we have our own bespoke parser in the x86_64 backend. This is a design flaw because we need to have consistent inline assembly syntax between the two backends; we need to parse it in a prior phase of the pipeline and then lower it to x86_64 MIR, or LLVM inline assembly.

ghost commented 2 years ago

While I do like the idea of getting all the syntax checks in the frontend, I can’t help but be nervous at the vast range of assembly syntaces out there. I’m thinking of bash using ) for switch cases, or vimscript using “ for comments; there will be some assembly out there which doesn’t play nice with the parser, no matter what we do. This of course gets exponentially hairier when we want some custom Zig syntax in there as well. If these issues can be solved universally, I’m all on board; though I would like to request that we have a clean solution for multiple return values, and we don’t need IN, OUT, INOUT, LATEOUT, INLATEOUT to appease the optimiser (ideally we don’t declare output registers at all). Prior art is #5241.

ghost commented 2 years ago

Oh, actually, here’s another thing: we run into problems if we use anything other than local registerlikes as operands. We could say global symbols are accessed as labels.

But that’s the limit of Zig integration I’m comfortable with. Anything else collides with the syntax of a not-too-obscure asm. Which does beg the question of how to handle outputs: we can’t break val, since that conflicts with AVR; we can’t add a sigil or something to specify “this is Zig, not asm”, since every sigil is taken; really the only solution I can think of is the post expression from #5241, which may cause problems if there are symbols in scope that clash with register names/condition codes. I really don’t know.

ifreund commented 2 years ago

The tokenizer is shared between Zig syntax and all dialects. One tokenizer to rule them all.

This is the one thing about this proposal that I'm iffy on. The best path forward I see while keeping the tokenizer shared is to never allow asm dialects to add new Zig tokens. Dialects would just have to make do with the tokens we use for the actual zig language, which I think should be sufficient for us to come up with decent syntax for any future asm variant. This would force greater divergence between zig asm dialects and the corresponding "standard" asm syntax if the "standard" syntax does not map well to existing Zig tokens.

Alternatively, we could skip over the inline asm specific source during "main" zig tokenization, for example by requiring that all asm dialects ensure disallow unmatched curly braces and skipping over bytes until the closing curly brace of the asm block is found. Then during parsing we could tokenize the contents of the inline asm block depending on the dialect, potentially allowing for syntax much closer to "standard" for the target and therefore more familiar to people new to Zig but not new to assembly. I think there are other benefits of keeping the syntax used for inline asm more "standard" as well.

The only disadvantage I see to that approach is that tokenization would no longer be line independent. However, I don't think there is any strong technical advantage to having line independent tokenization. The strongest benefit I see is that it makes Zig code subjectively easier for me and other humans to mentally parse. Keeping the Zig grammar simple in general helps with efficient/high quality tooling such as syntax highlighting, but I have yet to find any case where this specific property of line independent tokenization makes a difference.

ghost commented 2 years ago

Alternatively, we could skip over the inline asm specific source during "main" zig tokenization, for example by requiring that all asm dialects ensure disallow unmatched curly braces and skipping over bytes until the closing curly brace of the asm block is found.

Then this could be accomplished cleanly with the current string literal solution, with the additional advantage of maintaining line independence. I think this is strictly dominated by status quo and therefore not worth considering as an option.

ifreund commented 2 years ago

Alternatively, we could skip over the inline asm specific source during "main" zig tokenization, for example by requiring that all asm dialects ensure disallow unmatched curly braces and skipping over bytes until the closing curly brace of the asm block is found.

Then this could be accomplished cleanly with the current string literal solution, with the additional advantage of maintaining line independence. I think this is strictly dominated by status quo and therefore not worth considering as an option.

Status quo is asm being parsed in the backends after semantic analysis. What I'm proposing would keep parsing and tokenization of inline asm in the front end, but tokenizing inline asm differently per dialect instead of the same way as normal zig code.

ghost commented 2 years ago

Regardless of how this is implemented, it would introduce pretty tight coupling between the parser and an uncertain number of past and future assembly languages. I understand the benefits of course, but still, is this really in scope for the project? Zig is still intended to be a reasonably simple and reasonably portable language, right?

Vexu commented 2 years ago

What if we tried to create our own generic assembly syntax? Something like:

AsmExpr <- KEYWORD_asm KEYWORD_volatile? LBRACE (Data / Directive / Instruction)+ RBRACE

Data <- (IDENTIFIER COLON)? SIGIL IDENTIFIER Expr SEMICOLON

Directive <- DOT IDENTIFIER (DirectiveOperand (COMMA DirectiveOperand)*)?
DirectiveOperand <- IDENTIFIER / STRING_LITERAL / INTEGER

Instruction <- (IDENTIFIER COLON)? (Result =)? OpCode (Operand (COMMA Operand)*)?
Result <- ?
OpCode <- IDENTIFIER (DOT IDENTIFIER)*
Operand <- ?

It likely wouldn't be a perfect match for every target assembly and might require a bit tweaking but it would keep the tokenizer and parser simple and target agnostic.

natanalt commented 2 years ago

I feel like a generic assembly syntax would either make a lot of people unhappy because of differences from "standard" assembly syntaxes, or require the grammar to get quite complex and end up including a lot of cases that would just be unused by many backends.

I think that a solution could be to just generally treat inline assembly code as an arbitrary set of tokens between curly braces, to be properly parsed by each backend individually, like was suggested above. Sure, at this point we're sort of close to just using string literals for code, but those would have to be parsed anyway, so why not just use a shared tokenizer across the compiler?

gwenzek commented 2 years ago

One of the advantage of raw strings is that it really expand what you can do in userland.

I was able to implement a PTX backend (Nvidia GPU "assembly") by using inline assembly to emit PTX snippet directly. The assembly is first translated to LLVM inline assembly then directly to PTX. This allowed to exposed PTX intrinsics in a library https://github.com/gwenzek/cudaz/blob/68638782c52035c572af61db346082732ffb7014/CS344/src/kernel_utils.zig#L103

I'm pretty sure PTX wouldn't had been on the radar of a builtin assembly parsing and I would have needed much more complex modifications to Zig.

So string based assembly should stay an option

metroidchild commented 1 year ago

The main thing I would ask for is to change the wording of dialect into "family", and making each subset the actual dialect. This way linters can more easily catch syntax errors up front, and adding new dialects would become less of a hassle.

Additionally I believe a special "raw" family should exist to signify we still want to use string literals, for the cases where a specific ISA or dialect doesn't have support yet.

I was able to implement a PTX backend (Nvidia GPU "assembly") by using inline assembly to emit PTX snippet directly. The assembly is first translated to LLVM inline assembly then directly to PTX. This allowed to exposed PTX intrinsics in a library https://github.com/gwenzek/cudaz/blob/68638782c52035c572af61db346082732ffb7014/CS344/src/kernel_utils.zig#L103

I'm pretty sure PTX wouldn't had been on the radar of a builtin assembly parsing and I would have needed much more complex modifications to Zig.

So string based assembly should stay an option

I believe this is already partially addressed in #9514, where PTX would belong to the LLVM IR family, making the creation of its dialect much easier.

Whatever the case, taking your linked code as an example:

var ctaid = asm volatile ("mov.u32 \t%[r], %ctaid.z;"
    : [r] "=r" (-> utid),
);
return ctaid;

Would initially only slightly change to something like:

// not sure what a more reasonable syntax for this would be
var ctaid = asm raw volatile {
    "mov.u32 \t%[r], %ctaid.z;"
        : [r] "=r" (-> utid),
};
return ctaid;

And if the PTX dialect is ever added to the LLVM family, it might look like:

// reuse label syntax for generic register mangling?
var ctaid: utid = asm llvm.ptx volatile {
    mov.u32 :r, ctaid.z
    break :r
};
return ctaid;

I had many more ideas on potential Zig ASM conventions outside of this tiny snippet, but that's off topic.

alexrp commented 3 months ago

Some random thoughts on this in no particular order:

Per zig zen ("avoid local maximums"), I think we should be highly opinionated on syntax, and break with all legacy. Make the syntax as regular as possible across dialects unless there's a very compelling reason not to. If there's one thing I've learned from my recent work on Zig, it's that there are far too damn many assembly syntaxes that are superficially different for no real reason. If you have to work on 5+ different architectures, good luck memorizing the syntax for comments and literals across all of them, for example. Anyone who really wants the legacy syntax can be served by #21169.
- Some food for thought on this point: Hexagon has an assembly language that is radically different from the ones you're all probably used to; it reads almost like a high-level language. Register assignments use =, some instructions look like function calls, some operations can be expressed with binary operators like +, etc. Here's a sample of how it looks.
We need to figure out how to deal with labels and branches. Due to the way some architectures work (hardware loops, limited LL/SC sequence length, etc), this has to be part of the inline assembly syntax; we can't just tell people to split their asm expression into multiple asm expressions and rely on Zig-level branching constructs, because the compiler could do whatever it wants in between those asm expressions.
Should we allow definition of global symbols in inline assembly? I lean towards no because that's a complexity rabbit hole, but the reality is that we do have cases like this one in the standard library. It may just be the case that this will have to be moved to a separate assembly file after #21169.
Some architectures allow VLIW "instruction packets". For example, in Hexagon assembly language, you can write { insn1[, ..., insn4] } (i.e. 1-4 instructions) and these will be executed in parallel. This must be representable in Zig's inline assembly. I don't think this will be particularly hard; you could imagine something like packet { ...insns... }.
There needs to be some kind of option manipulation syntax. For example, on RISC-V, we have to disable gp relaxation for a bit to actually initialize gp. I think it would be fine if this looked something like option !relax { ...insns... }.
There needs to be syntax to embed both raw data and symbol addresses in the instruction stream. We make extensive use of this here and here, for example. Something like a data <type> <expression> pseudo-instruction, perhaps.
There needs to be an align pseudo-instruction. See here for just one example that requires it, but there are countless other reasons that one might want to align an instruction on some unusual boundary.
We need to figure out what we want to do about delay slots. Hopefully we can all agree that the way they work in GNU as is terrible. I don't have a concrete suggestion here yet.

I'll probably think of other stuff to add here later...

ziglang / zig

parse inline assembly syntax according to a set of dialects; integrate inline assembly more closely with the zig language #10761