ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
35.1k stars 2.56k forks source link

parsing failure on windows #9257

Closed vrischmann closed 3 years ago

vrischmann commented 3 years ago

Hi,

the following code fails to compile only on Windows:

fn createTestTables() !void {
    const bar = &[_][]const u8{
        \\foobar
    };
    _ = bar;
}

I get this error:

>> C:\Users\vincent\dev\zig-windows-x86_64-0.9.0-dev.321+15a030ef3\zig.exe build-obj .\sqlite.zig
.\sqlite.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^

I'm testing with zig-windows-x86_64-0.9.0-dev.321+15a030ef3.zip.

Weird thing is it works fine on Linux and FreeBSD.

jmc-88 commented 3 years ago

I can reproduce the issue on a Linux system, provided that I save the file with DOS newlines (CR+LF). Likely the lexer is choking on the unexpected CR.

Easy steps to reproduce:

jmc-88 commented 3 years ago

A patch as simple as this seems to make the tokenizer happy:

diff --git a/lib/std/zig/tokenizer.zig b/lib/std/zig/tokenizer.zig
index 94a20d958..edf690ebe 100644
--- a/lib/std/zig/tokenizer.zig
+++ b/lib/std/zig/tokenizer.zig
@@ -834,7 +834,7 @@ pub const Tokenizer = struct {
                 },

                 .multiline_string_literal_line => switch (c) {
-                    '\n' => {
+                    '\r', '\n' => {
                         self.index += 1;
                         break;
                     },

I'll see if there's any tests where I can record the regression and send a PR out for review.

Avokadoen commented 3 years ago

I have the same issue on zig-windows-x86_64-0.9.0-dev.397+b7da1b2d4

andrewrk commented 3 years ago

Thanks for the patch @jmc-88, however I reverted it because this was working as designed. Carriage returns are discouraged in general, removed by zig fmt, and in the case of multi-line string literals, not allowed at all. If you need carriage returns in a string literal, you have to use a clunkier syntax.

andrewrk commented 3 years ago

See https://github.com/ziglang/zig-spec/issues/38 for clarification. I opened an issue because the draft spec and the current compiler implementation do not match up exactly; this issue has the canonical specification.

Hadron67 commented 3 years ago

If no CRs are allowed in multi-line string literals, does this mean it's impossible to include multi-line string literals in source files with DOS newlines without an editor plugin?

andrewrk commented 3 years ago

I'm not sure what you mean about an editor plugin, but yes, if you need a string with DOS newlines, it would require a strategy such as:

Hadron67 commented 3 years ago

I mean, if my source file was saved with DOS newlines and it includes multiline strings, the newlines in them would become CRLF and thus rejected by the compiler, as in this issue. But I don't mean to verbatim include those CRLFs in my string, I just want them to be standard '\n's. To solve this I'll need to either save the file with UNIX newlines, or use some editor plugin that converts CRs in multiline strings to NL.

I know language such as Javascript accepts all styles of newlines in multiline string literals, but they will get converted to '\n', making the newlines unambiguous. Maybe this should also be zig's case?

jmc-88 commented 3 years ago

Thanks for the patch @jmc-88, however I reverted it because this was working as designed. Carriage returns are discouraged in general, removed by zig fmt, and in the case of multi-line string literals, not allowed at all. If you need carriage returns in a string literal, you have to use a clunkier syntax.

OK, but then looking at https://github.com/ziglang/zig-spec/issues/38, '\t' characters should be similarly disallowed, while currently the tokenizer ignores them? From the above it sounds this line should also disappear.

In any case, if certain bytes are disallowed it may help to improve the error message to point out which invalid bytes were found and rejected.

I mean, if my source file was saved with DOS newlines and it includes multiline strings, the newlines in them would become CRLF and thus rejected by the compiler, as in this issue. But I don't mean to verbatim include those CRLFs in my string, I just want them to be standard '\n's. To solve this I'll need to either save the file with UNIX newlines, or use some editor plugin that converts CRs in multiline strings to NL.

I think what @andrewrk suggests is that you need to apply zig fmt as part of your development workflow, to automatically convert these bytes sequences for you.

vrischmann commented 3 years ago

I just got back to this. My issue is that git on windows has core.autocrlf=true by default (at least on my vanilla installation) which now will cause code to not compile. Simply cloning like this works: git clone --config core.autocrlf=false but I don't think this is ideal.

vrischmann commented 3 years ago

Also, I can't run zig fmt on my file either:

PS C:\Users\vincent\dev\tmp\invalid-bytes> C:\Users\vincent\dev\zig-windows-x86_64-0.9.0-dev.453+7ef854682\zig.exe fmt .\main.zig
.\main.zig:3:17: error: expected ',', found invalid bytes
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: ' '
        \\foobar
                 ^
.\main.zig:3:17: error: expected expression, found 'invalid'
        \\foobar
                ^
.\main.zig:3:18: note: invalid byte: ' '
        \\foobar
                 ^
.\main.zig:4:6: error: expected test, comptime, var decl, or container field, found ';'
    };
     ^
.\main.zig:5:12: error: expected ',', found ';'
    _ = bar;
           ^
.\main.zig:6:1: error: expected 'eof', found '}'
}
^
Hadron67 commented 3 years ago

I think what @andrewrk suggests is that you need to apply zig fmt as part of your development workflow, to automatically convert these bytes sequences for you.

zig fmt rejects them too, as ziglang/zig-spec#38 suggests:

  • zig fmt may not mangle multi line string literals, and therefore the control characters TAB and CR are rejected by the grammar inside multi-line string literals.
crystalthoughts commented 3 years ago

This has really sent me down a rabbit hole trying to get certain repositories to work on Windows! I think this is one of those times where striving for simplicity is actually making things harder in certain cases.

It's quite a subtle thing to try and debug for someone not already aware of the issue, especially given the non-visible nature of whitespace. Is there a way for this to fail gracefully? Either by a conversion prompt or a tweak to how things are handled?

tecanec commented 2 years ago

I must point out that this has confused me, as well. CR is mostly invisible to the programmer, so it gets confusing when your code suddenly won't compile on Windows and the compiler is complaining about an invalid byte in your string literal that you can't see.

I agree with throwing compile errors when the programmer themself makes a mistake. However, I do think that this particular issue is less about the programmer and more about compatibility with software made by third parties. And even though most proper text editors have settings for choosing between CRLF or just LF, I still think that it's important to support any could-be outliers. In other words: The language belongs to Zig, but the format for encoding text files doesn't. Zig merely supports it, and holds no more power over the encoding than the average text editor does.

@Hadron67 mentioned the idea of simply converting the line endings of string literals to LF. I agree with that idea. While it technically is a "hidden element", it does simplify things overall and shouldn't be hard to document. It means that you can write the exact same code in any text editor and get the exact same result, regardless of whether the editor uses CRLF or LF. I'm fine with using \r when I really have to, as long as everything else is consistent.

However, if the compiler really must reject CR like this, I'd appreciate it if it were at least being less confusing about it. Rejecting CR in multiline string literals specifically seems arbitrary and confusing, so I honestly think that it'd be better to just outright reject CR everywhere, period. Other than that, simply making the compile error a bit more specific (such as error: expected ',', found invalid byte 'CR') seems like a safe bet.

motiejus commented 1 year ago

I suggest we re-open this issue.

One more data point: https://lists.sr.ht/~motiejus/bazel-zig-cc/%3C167179347159.25324.12223141818448163553-0%40git.sr.ht%3E#%3Cee9821ec-6b7c-4d62-a1f2-0b54cc2ef728@hahn.graphics%3E

Extract from @FabianHahn email:

Zig further fails to compile launcher.zig with the following error:

launcher.zig:70:73: error: expected ';' after declaration

     \\Usage: <...>/tools/<target-triple>/{[zig_tool]s}{[exe]s} <args>...

                                                                         ^~~~

launcher.zig:71:3: note: invalid byte: ' '

     \\

   ^

The reason for this is that Git on Windows inserts carriage return characters before newlines when checking out local files so they show up fine in Windows text editors where \r\n is the default for line breaks. This then breaks the zig parser's support for multiline constants as detailed in https://github.com/ziglang/zig/issues/9257 which is marked as closed even though it doesn't look resolved to me at all. Converting newlines is the default behavior for Git unless you explicitly disable it by setting core.autocrlf in your Git config. While that would be a workaround, I would find this an uncommon thing to do and an odd thing to require of Windows users, which is something also pointed out in the above issue.

Vexu commented 1 year ago

That should be covered by #11414