Open wooster0 opened 7 months ago
as long as there can be a way such that zig build-*
rejects it but zig fmt
can still accept and fix it, then I support this.
+1, we regularly step into this in TigerBeetle, where there's an un-noticed trailing whitespace in \\hello world
(which is annoying in and of itself), which then subsequently leaks leaks into whatever we are using this string literal for.
I wanted to quickly fix this today, but, after about an hour hacking, I think I'll give up for now :-)
I did come up with some tests here:
Note for whoever wants to fix this next (perhaps future me?):
" " | "\t"
and warn. The problem with this approach is that the parser doesn't know where the token ends, only where it starts. So, this'll require scanning through the literal's text the second time around, which feels redundant..multiline_string_literal_invalid
to the set of tokens returned by the lexer, but I don't like it because that means duplciating all the call sites to handle both .multiline_string_literal
and .multiline_string_literal_invalid
.invalid_trailing_whitespace
token from the lexer, so that, onece it sees a multiline literal with trailing whitespace, it'd split it in two tokens. This doesn't work because in the AST multiline string literal holds a contiguous slice of tokens for individual lines, and, if you intersperse them with invalid tokens, everything that handles that AST node also have to changed. Sadly, I was smart enough to come up with all those ideas, but not smart enough to realize that they don't work before implementing them, so I'll leave it at that for now :rofl:
Some further ideas:
Yeah, I'd expect this to be done as an AstGen error.
another benefit of making it an AstGen error is that zig fmt could autofix it
@nektro per the literal first sentence of the proposal, we will not have zig fmt
modify such code (emphasis my own):
I propose
zig fmt
to not change and the compiler to complain, for a start, when there is ASCII whitespace at the end of a multiline string.
Having zig fmt
silently perform a hard-to-spot, potentially functional, change to code is not a good idea. The whole point of this error is that the intent is unclear. Thus, it would be completely irrational for zig fmt
to assume that intent.
if the user wants trailing space they should not be using a multiline string. the whole point of this proposal is that it'd be completely invisible to both the writer and reader of the code. barely any software warns about it and zig fmt asserting/protecting you from it would not be a bad thing
edit: oh i see i already left this opinion when this was first opened so ill stop my comments here
For what it's worth, I don't like it. If I declare a string literal, then most likely that is literally the string that I want.
An example where this would be unfortunate is if you're generating markdown (or similar) files from templates, and you want to include explicit line-breaks (two trailing spaces) in the template.
Couldn't you lint this on your own machines or ci?
bash-5.2$ cat temp.zig
const _ =
\\hello
\\world
\\oops
\\foo
\\bar
;
bash-5.2$ grep -EHn '^\s*\\\\.*\s+$' temp.zig
temp.zig:4: \\oops
An example where this would be unfortunate is if you're generating markdown (or similar) files from templates, and you want to include explicit line-breaks (two trailing spaces) in the template.
This is precisely the case where it is beneficial to see trailing white-space marked explicitly in the source code. It's hard to spot a bug in
writeMarkdown(
\\Roses are red
\\ Violets are blue,
\\Sugar is sweet
\\ And so are you.
);
see trailing white-space marked explicitly
But the proposal is to forbid it, not to mark it.
We don't need a new feature to mark trailing whitespace because there is a number of existing language features you can use for this niche purpose sufficiently effectively:
// Use normal string, the perfect solution for this specific example
writeMarkdown("" ++
"Roses are red \n" ++
" Violets are blue, \n" ++
"Sugar is sweet \n" ++
" And so are you. \n");
// For one off line break, the following works
writeMarkdown(
\\Roses are red
\\ Violets are blue,
++ " \n" ++
\\Sugar is sweet
\\ And so are you.
);
// If you need this a lot for a particular use-case, you can make domain-specific helper:
writeMarkdown(pre(
\\Roses are red
\\ Violets are blue,
\\Sugar is sweet
\\ And so are you.
));
writeMarkdown(
\\Roses are red {br}
\\ Violets are blue, {br}
\\Sugar is sweet {br}
\\ And so are you. {br}
);
// Finally, it might be worth to stop and think whether you actually _must_
// use a trailing whitespace, or whether there's a better solution:
writeMarkdown(
\\Roses are red\
\\ Violets are blue,\
\\Sugar is sweet\
\\ And so are you.\
);
niche purpose
That is probably true, but there aren't very many people arguing either way here.
\\Roses are red\
I suppose could live with this.
Edited: Grepping a large code base such as zig or tigerbeetle for this takes milliseconds on my machine, though it's only a few years old, I understand it can take some amount longer on older hardware or laptops. Probably not that much longer, though.
I wanted to see if looking for this at build time is feasible and came up with this proof of concept. It's rough and not portable but, it seems like it could be done.
diff --git a/build.zig b/build.zig
index 054b67b..b1ea75c 100644
--- a/build.zig
+++ b/build.zig
@@ -46,6 +46,7 @@ pub fn build(b: *std.Build) !void {
// Top-level steps you can invoke on the command line.
const build_steps = .{
+ .trail = b.step("trail", "Look for trailing whitespace"),
.aof = b.step("aof", "Run TigerBeetle AOF Utility"),
.check = b.step("check", "Check if TigerBeetle compiles"),
.clients_c = b.step("clients:c", "Build C client library"),
@@ -257,6 +258,28 @@ pub fn build(b: *std.Build) !void {
break :blk install.getDest();
};
+ { // zig build trail
+ var code: u8 = undefined;
+ const grep = b.runAllowFail(&.{
+ "grep",
+ "-EHnr",
+ \\^\s.*\s+$
+ ,
+ "--include=*.zig",
+ }, &code, .Inherit) catch |err| switch (err) {
+ error.ExitCodeFailure => if (code == 1) "ok" else blk: {
+ std.debug.print("{}\n", .{code});
+ break :blk "what";
+ },
+ else => return err,
+ };
+ std.debug.print("{}\n", .{code});
+ if (code > 2) {
+ std.debug.print("{s}\n", .{grep});
+ return error.LintError;
+ }
+ }
+
{ // zig build aof
const aof = b.addExecutable(.{
.name = "aof",
diff --git a/src/vopr.zig b/src/vopr.zig
index da11e4c..414b56f 100644
--- a/src/vopr.zig
+++ b/src/vopr.zig
@@ -234,7 +234,7 @@ pub fn main() !void {
\\ SEED={}
\\
\\ replicas={}
- \\ standbys={}
+ \\ standbys={}
\\ clients={}
\\ request_probability={}%
\\ idle_on_probability={}%
And the output:
$ zig build trail
170
src/vopr.zig:237: \\ standbys={}
error: LintError
/home/thomas/Tools/tigerbeetle/build.zig:279:13: 0x11212c9 in build (build)
return error.LintError;
^
/home/thomas/zig/0.13.0/files/lib/std/Build.zig:2117:24: 0x1103d37 in runBuild__anon_8835 (build)
.ErrorUnion => try build_zig.build(b),
^
/home/thomas/zig/0.13.0/files/lib/compiler/build_runner.zig:301:9: 0x10ff060 in main (build)
try builder.runBuild(root);
^
error: the following build command failed with exit code 1:
/home/thomas/Tools/tigerbeetle/.zig-cache/o/5ba6ffbcb74aeb9315c6bd2f8977a108/build /home/thomas/zig/0.13.0/files/zig /home/thomas/Tools/tigerbeetle /home/thomas/Tools/tigerbeetle/.zig-cache /home/thomas/.cache/zig --seed 0x9d32edc7 -Zbd117d5a66a8de31 trail
For what it's worth, I know I don't have much of a leg to stand on but I didn't want this to go through without mentioning it.
Status quo
Status quo of trailing whitespace:
There is a normal space followed by a full-width space.
zig fmt
.There are a bunch of trailing full-width spaces.
Technically it'd be useful if
zig fmt
formatted away trailing Unicode whitespace as well but the Zig compiler currently doesn't really have much Unicode awareness. With that in mind, this is what happens for multiline strings:zig fmt
does not format anything away and the compiler doesn't error.Trailing whitespace as it is handled right now is in my opinion perfectly fine except for multiline strings.
Proposal
I propose
zig fmt
to not change and the compiler to complain, for a start, when there is ASCII whitespace at the end of a multiline string.The reason is that trailing whitespace at the end of a multiline string is almost always a mistake and the intent is rather unclear (Communicate intent precisely). Was the intent to have this trailing whitespace or not?
Example: https://github.com/ziglang/zig/blob/778ab767b1e367233a1a3284e2c24d2c27b44602/lib/compiler/aro/aro/Driver.zig#L111-L116
The
-fdollars-in-identifiers
and-fno-dollars-in-identifiers
lines have trailing ASCII spaces. Select the text to see them. Mistakes like this are common in multiline strings and even your editor won't format them away for you if it knows about Zig multiline strings (and so knows that the spaces are part of a string, which are part of the program and have semantic implications).Was this really a mistake or did iddev5 (https://github.com/Vexu/arocc/pull/152/files) mean to put these ASCII spaces there?
The intent is more or less unclear and so I propose the compiler to behave like this:
Meaning, the fix would be:
Now intent is very clear. You specifically concatenated spaces in between (for whatever reason) so clearly you intended to do this. Because
""
strings are singleline, you can inspect whatever is in between those two double quotes and you don't have to worry about trailing whitespace in multiline strings that you didn't see.Here, again, as above it would technically be useful if it errors for Unicode whitespace as well. But because (as above)
zig fmt
doesn't format away trailing Unicode whitespace, I think it's fine for the compiler to only complain for ASCII spaces for now, that is untilzig fmt
starts formatting away trailing Unicode whitespace as well.Reasoning
The rules "Communicate intent precisely" and "Favor reading code over writing code".
When I say non-ASCII Unicode trailing whitespace, I am also talking about Related Unicode characters with property White_Space=no (second table), such as zero-width spaces. They fit this category as well.