Open andrewrk opened 2 years ago
While I do like the idea of getting all the syntax checks in the frontend, I can’t help but be nervous at the vast range of assembly syntaces out there. I’m thinking of bash using ) for switch cases, or vimscript using “ for comments; there will be some assembly out there which doesn’t play nice with the parser, no matter what we do. This of course gets exponentially hairier when we want some custom Zig syntax in there as well. If these issues can be solved universally, I’m all on board; though I would like to request that we have a clean solution for multiple return values, and we don’t need IN, OUT, INOUT, LATEOUT, INLATEOUT to appease the optimiser (ideally we don’t declare output registers at all). Prior art is #5241.
Oh, actually, here’s another thing: we run into problems if we use anything other than local registerlikes as operands. We could say global symbols are accessed as labels.
But that’s the limit of Zig integration I’m comfortable with. Anything else collides with the syntax of a not-too-obscure asm. Which does beg the question of how to handle outputs: we can’t break val
, since that conflicts with AVR; we can’t add a sigil or something to specify “this is Zig, not asm”, since every sigil is taken; really the only solution I can think of is the post expression from #5241, which may cause problems if there are symbols in scope that clash with register names/condition codes. I really don’t know.
The tokenizer is shared between Zig syntax and all dialects. One tokenizer to rule them all.
This is the one thing about this proposal that I'm iffy on. The best path forward I see while keeping the tokenizer shared is to never allow asm dialects to add new Zig tokens. Dialects would just have to make do with the tokens we use for the actual zig language, which I think should be sufficient for us to come up with decent syntax for any future asm variant. This would force greater divergence between zig asm dialects and the corresponding "standard" asm syntax if the "standard" syntax does not map well to existing Zig tokens.
Alternatively, we could skip over the inline asm specific source during "main" zig tokenization, for example by requiring that all asm dialects ensure disallow unmatched curly braces and skipping over bytes until the closing curly brace of the asm block is found. Then during parsing we could tokenize the contents of the inline asm block depending on the dialect, potentially allowing for syntax much closer to "standard" for the target and therefore more familiar to people new to Zig but not new to assembly. I think there are other benefits of keeping the syntax used for inline asm more "standard" as well.
The only disadvantage I see to that approach is that tokenization would no longer be line independent. However, I don't think there is any strong technical advantage to having line independent tokenization. The strongest benefit I see is that it makes Zig code subjectively easier for me and other humans to mentally parse. Keeping the Zig grammar simple in general helps with efficient/high quality tooling such as syntax highlighting, but I have yet to find any case where this specific property of line independent tokenization makes a difference.
Alternatively, we could skip over the inline asm specific source during "main" zig tokenization, for example by requiring that all asm dialects ensure disallow unmatched curly braces and skipping over bytes until the closing curly brace of the asm block is found.
Then this could be accomplished cleanly with the current string literal solution, with the additional advantage of maintaining line independence. I think this is strictly dominated by status quo and therefore not worth considering as an option.
Alternatively, we could skip over the inline asm specific source during "main" zig tokenization, for example by requiring that all asm dialects ensure disallow unmatched curly braces and skipping over bytes until the closing curly brace of the asm block is found.
Then this could be accomplished cleanly with the current string literal solution, with the additional advantage of maintaining line independence. I think this is strictly dominated by status quo and therefore not worth considering as an option.
Status quo is asm being parsed in the backends after semantic analysis. What I'm proposing would keep parsing and tokenization of inline asm in the front end, but tokenizing inline asm differently per dialect instead of the same way as normal zig code.
Regardless of how this is implemented, it would introduce pretty tight coupling between the parser and an uncertain number of past and future assembly languages. I understand the benefits of course, but still, is this really in scope for the project? Zig is still intended to be a reasonably simple and reasonably portable language, right?
What if we tried to create our own generic assembly syntax? Something like:
AsmExpr <- KEYWORD_asm KEYWORD_volatile? LBRACE (Data / Directive / Instruction)+ RBRACE
Data <- (IDENTIFIER COLON)? SIGIL IDENTIFIER Expr SEMICOLON
Directive <- DOT IDENTIFIER (DirectiveOperand (COMMA DirectiveOperand)*)?
DirectiveOperand <- IDENTIFIER / STRING_LITERAL / INTEGER
Instruction <- (IDENTIFIER COLON)? (Result =)? OpCode (Operand (COMMA Operand)*)?
Result <- ?
OpCode <- IDENTIFIER (DOT IDENTIFIER)*
Operand <- ?
It likely wouldn't be a perfect match for every target assembly and might require a bit tweaking but it would keep the tokenizer and parser simple and target agnostic.
I feel like a generic assembly syntax would either make a lot of people unhappy because of differences from "standard" assembly syntaxes, or require the grammar to get quite complex and end up including a lot of cases that would just be unused by many backends.
I think that a solution could be to just generally treat inline assembly code as an arbitrary set of tokens between curly braces, to be properly parsed by each backend individually, like was suggested above. Sure, at this point we're sort of close to just using string literals for code, but those would have to be parsed anyway, so why not just use a shared tokenizer across the compiler?
One of the advantage of raw strings is that it really expand what you can do in userland.
I was able to implement a PTX backend (Nvidia GPU "assembly") by using inline assembly to emit PTX snippet directly. The assembly is first translated to LLVM inline assembly then directly to PTX. This allowed to exposed PTX intrinsics in a library https://github.com/gwenzek/cudaz/blob/68638782c52035c572af61db346082732ffb7014/CS344/src/kernel_utils.zig#L103
I'm pretty sure PTX wouldn't had been on the radar of a builtin assembly parsing and I would have needed much more complex modifications to Zig.
So string based assembly should stay an option
The main thing I would ask for is to change the wording of dialect into "family", and making each subset the actual dialect. This way linters can more easily catch syntax errors up front, and adding new dialects would become less of a hassle.
Additionally I believe a special "raw" family should exist to signify we still want to use string literals, for the cases where a specific ISA or dialect doesn't have support yet.
I was able to implement a PTX backend (Nvidia GPU "assembly") by using inline assembly to emit PTX snippet directly. The assembly is first translated to LLVM inline assembly then directly to PTX. This allowed to exposed PTX intrinsics in a library https://github.com/gwenzek/cudaz/blob/68638782c52035c572af61db346082732ffb7014/CS344/src/kernel_utils.zig#L103
I'm pretty sure PTX wouldn't had been on the radar of a builtin assembly parsing and I would have needed much more complex modifications to Zig.
So string based assembly should stay an option
I believe this is already partially addressed in #9514, where PTX would belong to the LLVM IR family, making the creation of its dialect much easier.
Whatever the case, taking your linked code as an example:
var ctaid = asm volatile ("mov.u32 \t%[r], %ctaid.z;"
: [r] "=r" (-> utid),
);
return ctaid;
Would initially only slightly change to something like:
// not sure what a more reasonable syntax for this would be
var ctaid = asm raw volatile {
"mov.u32 \t%[r], %ctaid.z;"
: [r] "=r" (-> utid),
};
return ctaid;
And if the PTX dialect is ever added to the LLVM family, it might look like:
// reuse label syntax for generic register mangling?
var ctaid: utid = asm llvm.ptx volatile {
mov.u32 :r, ctaid.z
break :r
};
return ctaid;
I had many more ideas on potential Zig ASM conventions outside of this tiny snippet, but that's off topic.
Some random thoughts on this in no particular order:
zig zen
("avoid local maximums"), I think we should be highly opinionated on syntax, and break with all legacy. Make the syntax as regular as possible across dialects unless there's a very compelling reason not to. If there's one thing I've learned from my recent work on Zig, it's that there are far too damn many assembly syntaxes that are superficially different for no real reason. If you have to work on 5+ different architectures, good luck memorizing the syntax for comments and literals across all of them, for example. Anyone who really wants the legacy syntax can be served by #21169.
=
, some instructions look like function calls, some operations can be expressed with binary operators like +
, etc. Here's a sample of how it looks.asm
expression into multiple asm
expressions and rely on Zig-level branching constructs, because the compiler could do whatever it wants in between those asm
expressions.{ insn1[, ..., insn4] }
(i.e. 1-4 instructions) and these will be executed in parallel. This must be representable in Zig's inline assembly. I don't think this will be particularly hard; you could imagine something like packet { ...insns... }
.gp
relaxation for a bit to actually initialize gp
. I think it would be fine if this looked something like option !relax { ...insns... }
.data <type> <expression>
pseudo-instruction, perhaps.align
pseudo-instruction. See here for just one example that requires it, but there are countless other reasons that one might want to align an instruction on some unusual boundary.as
is terrible. I don't have a concrete suggestion here yet.I'll probably think of other stuff to add here later...
Currently we have this situation:
++
.Here's one example of what inline assembly looks like today, for x86_64:
This proposal is to introduce the concept of dialects. As a first pass, the set of dialects would be exactly the
std.Target.Cpu.Arch
enum. But it's likely that some dialects would be shared by multiple architectures. For example, x86 and x86_64 would probably share thex86
dialect. So we will have a separate enum for dialects.A dialect is specified as an identifier token (it must be an identifier) directly after the
asm
keyword, before thevolatile
keyword if any, and it tells how to parse the assembly syntax:I made some other changes here for fun but that's outside the scope of this proposal; this proposal is pointing out that we change the
(
)
to braces and inside there is not a string literal but syntax that is more closely integrated with the zig language.The tokenizer is shared between Zig syntax and all dialects. One tokenizer to rule them all.
The dialect tells the parser how to parse what is inside the braces. You can imagine how x86 is parsed in a drastically different manner than WebAssembly or SPIR-V.
Rather than the burden of parsing inline assembly falling on the backend, it falls on the frontend, where it is properly cached and it is easier to report errors. This also provides a way to unify inline assembly across multiple backends; for example right now we send inline assembly straight to LLVM with the LLVM backend, but we have our own bespoke parser in the x86_64 backend. This is a design flaw because we need to have consistent inline assembly syntax between the two backends; we need to parse it in a prior phase of the pipeline and then lower it to x86_64 MIR, or LLVM inline assembly.