ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.5k stars 2.45k forks source link

Parser crash when invalid bytes are inside a comment (depends on comment placement) #13690

Open squeek502 opened 1 year ago

squeek502 commented 1 year ago

Zig Version

0.11.0-dev.363+870872dd6

Steps to Reproduce and Observed Behavior

Save the following in a Windows-1252 encoded file:

// ¶
test {}

(the test {} is not important, it can be anything as long as it's a non-comment AFAICT)

Run e.g. zig fmt on it:

thread 33228 panic: integer overflow
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:3755:37: 0x7ff7c825983c in tokensOnSameLine (zig.exe.obj)
        return std.mem.indexOfScalar(u8, p.source[p.token_starts[token1]..p.token_starts[token2]], '\n') == null;
                                    ^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:209:58: 0x7ff7c82599c1 in warnMsg (zig.exe.obj)
            => if (msg.token != 0 and !p.tokensOnSameLine(msg.token - 1, msg.token)) {
                                                         ^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:236:22: 0x7ff7c843840a in failMsg (zig.exe.obj)
        try p.warnMsg(msg);
                     ^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:222:25: 0x7ff7c8436d71 in fail (zig.exe.obj)
        return p.failMsg(.{ .tag = tag, .token = p.tok_i });
                        ^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:604:43: 0x7ff7c8436b7d in expectTestDecl (zig.exe.obj)
        if (block_node == 0) return p.fail(.expected_block);
                                          ^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:616:32: 0x7ff7c825ef28 in expectTestDeclRecoverable (zig.exe.obj)
        return p.expectTestDecl() catch |err| switch (err) {
                               ^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:277:75: 0x7ff7c825c25b in parseContainerMembers (zig.exe.obj)
                    const test_decl_node = try p.expectTestDeclRecoverable();
                                                                          ^
C:\Users\Ryan\Programming\Zig\zig\lib\std\zig\parse.zig:60:58: 0x7ff7c811b2c4 in parse (zig.exe.obj)
    const root_members = try parser.parseContainerMembers();
                                                         ^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:4361:33: 0x7ff7c8272372 in fmtPathFile (zig.exe.obj)
    var tree = try std.zig.parse(fmt.gpa, source_code);
                                ^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:4286:16: 0x7ff7c8121d9d in fmtPath (zig.exe.obj)
    fmtPathFile(fmt, file_path, check_mode, dir, sub_path) catch |err| switch (err) {
               ^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:4251:56: 0x7ff7c80fe6c4 in cmdFmt (zig.exe.obj)
        try fmtPath(&fmt, file_path, check_flag, fs.cwd(), file_path);
                                                       ^
C:\Users\Ryan\Programming\Zig\zig\src\main.zig:263:22: 0x7ff7c80d251a in mainArgs (zig.exe.obj)
        return cmdFmt(gpa, arena, cmd_args);
                     ^
C:\Users\Ryan\Programming\Zig\zig\src\stage1.zig:56:24: 0x7ff7c824cbd7 in main (zig.exe.obj)
        stage2.mainArgs(gpa, arena, args) catch unreachable;
                       ^

This is caused by p.token_starts[token1] being larger than p.token_starts[token2]. Specifically, it's attempting:

p.source[5..3]

Note that if the comment is moved after the test block like this:

// either of the below placements avoid the panic
test {} // ¶
// ¶

then we get a nice error message as expected:

invalid.zig:2:12: error: expected type expression, found 'invalid bytes'
test {} // �
           ^
invalid.zig:2:12: note: invalid byte: '\xb6'
test {} // �
           ^

Expected Behavior

invalid.zig:1:4: error: expected type expression, found 'invalid bytes'
// �
   ^
invalid.zig:1:4: note: invalid byte: '\xb6'
// �
   ^
ghost commented 1 year ago

I investigated this a bit. Seems like the problem is that it finds the invalid bytes token after the test token. If the comment contains valid bytes, it doesn't produce a separate token at all, so i assume it's because comments are considered part of the same token as the thing they're on.

ghost commented 1 year ago

I noticed that the tokenizer seems designed intentionally to do this; it has a pending_invalid_token field that it stores the invalid char in, but then goes on until it finds a valid token, returns that first, then next time it returns the pending invalid token.

There must a reason it's designed this way, but I can't imagine what it is. I would like to get whoever wrote this to weigh in. I was able to fix this issue with a patch that makes it so, after finding the valid token, it first returns the invalid one, then stores the valid token in pending_invalid_token (so it'll be returned next time). Obviously not a good solution because it abuses the field name.

ghost commented 1 year ago

My change also causes this test to fail:

test "invalid literal/comment characters" {
    try testTokenize("\"\x00\"", &.{
        .string_literal,
        .invalid,
    });

(But works if I switch the order of expected tokens)

According to git blame, the pending_invalid_token design was implemented in 2017 by Josh Wolfe; at that time it was for catching ASCII control codes. I don't know his github username, though.

rohlem commented 6 months ago

Probably related, zig version 0.12.0-dev.2763+7204eccf5 (for me on Windows 11) seems to have an off-by-one error when outputting the invalid byte for me. On a 4-byte file with only //xL , where x is an unprintable ASCII character (tried with 0x00, 0x01, 0x15, 0x16), zig fmt and zig test both give the following output:

main.zig:1:3: error: expected type expression, found 'invalid bytes'
//L
  ^
main.zig:1:4: note: invalid byte: 'L'
//L
   ^

obviously the invalid byte isn't 'L' but the byte before it - the second output is off-by-one.

Also note that the invalid byte isn't displayed at all - both CMD and Windows Terminal simply skip it (while f.e. git bash does display a character there) - should this maybe be accounted for?

gooncreeper commented 1 month ago

Cannot reproduce on master or 0.13.0 Just to be sure, here is the hexdump of the file I tested

00000000  2f 2f 20 b6 0a 74 65 73  74 20 7b 7d 0a           |// ..test {}.|

Additionally the off by one error was fixed by #20559