`std.json.parseFromTokenSource` incorrectly parses a unicode string under specific circumstances.

gcoakes commented 3 months ago

Zig Version

0.14.0-dev.839+a931bfada

Steps to Reproduce and Observed Behavior

names.json should be located next to the following test:

const std = @import("std");

test "pass #1" {
    const parsed = try std.json.parseFromSlice(std.json.Value, std.testing.allocator, @embedFile("names.json"), .{});
    defer parsed.deinit();

    try std.testing.expectEqualSlices(
        u8,
        "\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0",
        parsed.value.object.get("commands").?.array.items[51].object.get("action").?.object.get("field").?.string,
    );
}

test "fail #1 -- FixedBufferStream" {
    var stream = std.io.fixedBufferStream(@embedFile("names.json"));

    var reader = std.json.reader(std.testing.allocator, stream.reader());
    defer reader.deinit();

    const parsed = try std.json.parseFromTokenSource(std.json.Value, std.testing.allocator, &reader, .{});
    defer parsed.deinit();

    try std.testing.expectEqualSlices(
        u8,
        "\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0",
        parsed.value.object.get("commands").?.array.items[51].object.get("action").?.object.get("field").?.string,
    );
}

Run the test and observe the failures:

# zig test repro.zig
1/5 repro.test.pass #1...OK
2/5 repro.test.fail #1 -- FixedBufferStream...slices differ. first difference occurs at index 0 (0x0)

============ expected this output: =============  len: 12 (0xC)

E1 85 9F E1 85 A0 E3 85  A4 EF BE A0              ............

============= instead found this: ==============  len: 5 (0x5)

85 A4 EF BE A0                                    .....

================================================

FAIL (TestExpectedEqual)
/home/gcoakes/.local/share/zig/master/lib/std/testing.zig:401:5: 0x1051a14 in expectEqualSlices__anon_2722 (test)
    return error.TestExpectedEqual;
    ^
/home/gcoakes/src/wasm/repro.zig:23:5: 0x1105f97 in test.fail #1 -- FixedBufferStream (test)
    try std.testing.expectEqualSlices(
    ^
...

The following experiments were done:

Attempt with v0.13.0. => Fail
Attempt with v0.12.0. => Fail
Attempt with v0.11.0 => Inconclusive
std.json.parseFromSlice => Pass
FixedBufferStream => Fail
BufferedReader(4096, FixedBufferStream) (buffer boundary before fail point in JSON) => Fail
BufferedReader(8192, FixedBufferStream) (buffer boundary after fail point in JSON) => Fail
Replace unicode bytes with unicode escape sequences. => Pass
Reorder the failing object to be one earlier within the parent array. => Pass
Reorder the failing object to be one later within the parent array. => Pass
Minimize names.json to just the failing string. => Pass
Minimize by removing preceding objects in parent array. => Pass
Minimize by removing following objects in parent array. => Fail

Some general observations:

The actual value that is produced is a suffix of the expected value.
The length of the expected value in terms of codepoints is not equal to the length of the actual value in terms of bytes. An initial suspicion of mine was that the parser was confusing the units.
This is very sensitive to the position within names.json at which this unicode string occurs. This implies to me that it is something related to either the parsers state machine or buffering. The latter is less likely considering experiments 6 and 7.
Experiment 8 intuitively feels like a red herring considering that it changes the length of the string and the issue seems sensitive to the position within the JSON.
This is not a regression at least as far back as I could test trivially.

Expected Behavior

The following command should illustrate what is expected to be read from that field. You can ignore the trailing newline (0a) that is caused by jq:

$ jq '.commands[51].action.field' names.json -r | hexdump -C
00000000  e1 85 9f e1 85 a0 e3 85  a4 ef be a0 0a           |.............|
0000000d

std.json.parseFromTokenSource should correctly handle unicode text within a string without corruption or truncation.

ianprime0509 commented 3 months ago

I'm not able to reproduce this (also on Zig 0.14.0-dev.839+a931bfada); both tests pass for me with the provided input file. Here's my system information (uname -a), in the event this is somehow platform-specific:

Linux toolbox 6.9.10-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 18 21:39:30 UTC 2024 x86_64 GNU/Linux

I noticed that the output provided in the issue suggests there were originally 5 tests in the file, but the code has only 2 tests. While I assume the other 3 tests were just removed to make the example shorter and less cluttered, is it possible that any other parts of the reproduction tests or names.json input file were changed prior to submitting the issue (e.g. as part of the various experiments you were trying)? Just trying to rule out any potential source of variance, since it sounds like this only reproduces under very specific circumstances.

gcoakes commented 3 months ago

is it possible that any other parts of the reproduction tests or names.json input file were changed prior to submitting the issue (e.g. as part of the various experiments you were trying)?

I just double checked the names.json uploaded to this issue, and yes, it is different from the original file. I think it was from experiment 9 or 10. Good catch. I am not able to reproduce with that one either.

I just re-generated the original names.json (via WABT's wast2json utility). Uploaded here: names.json. I am still able to reproduce the issue with that file.

there were originally 5 tests in the file, but the code has only 2 tests.

Correct, those tests were from some of the experiments I was running. Minimizing to just the following still reproduces the issue for me:

const std = @import("std");

test "fail #1 -- FixedBufferStream" {
    var stream = std.io.fixedBufferStream(@embedFile("names.json"));

    var reader = std.json.reader(std.testing.allocator, stream.reader());
    defer reader.deinit();

    const parsed = try std.json.parseFromTokenSource(std.json.Value, std.testing.allocator, &reader, .{});
    defer parsed.deinit();

    try std.testing.expectEqualSlices(
        u8,
        "\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0",
        parsed.value.object.get("commands").?.array.items[51].object.get("action").?.object.get("field").?.string,
    );
}

My platform information.


$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux gcoakes-laptop 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux

ianprime0509 commented 3 months ago

Thanks! That helps, I was able to reproduce the issue with your latest file. I happened to notice (looking at a hex dump of names.json) that the point where the suffix of the string (what actually gets parsed in the test) begins was exactly at byte position 0x2000, which is a multiple of the default buffer size: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner.zig#L71 This seems to be the root of the issue, since the same problem can be reproduced on a much smaller input (just the affected string) by making the JSON reader buffer size small enough that it splits up the string at the same point:

const std = @import("std");

test {
    const s = "\"\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0\"";
    var stream = std.io.fixedBufferStream(s);

    var reader = std.json.Reader(8, @TypeOf(stream.reader())).init(std.testing.allocator, stream.reader());
    defer reader.deinit();

    const parsed = try std.json.parseFromTokenSource(std.json.Value, std.testing.allocator, &reader, .{});
    defer parsed.deinit();

    try std.testing.expectEqualSlices(
        u8,
        "\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0",
        parsed.value.string,
    );
}

This produces the same failure:

slices differ. first difference occurs at index 0 (0x0)

============ expected this output: =============  len: 12 (0xC)

E1 85 9F E1 85 A0 E3 85  A4 EF BE A0              ............

============= instead found this: ==============  len: 5 (0x5)

85 A4 EF BE A0                                    .....

================================================

1/1 json.test_0...FAIL (TestExpectedEqual)

This issue seems to occur specifically when a UTF-8-encoded codepoint is split on a buffer length boundary: the state machine loop in Scanner.next is using expectByte for "mid-codepoint in string" states, and if that function reaches the end of the internal buffer, it will return error.BufferUnderrun, which will cause the Reader to refill its buffer, but that loses anything scanned in the string up to that point (such as the first 7 bytes of the string in this case).

@thejoshwolfe do you consider it to be an API requirement for all partial_string tokens to contain only complete codepoints? Currently that is the case, but I don't see any documentation explicitly guaranteeing this, and I think this issue should be straightforward to solve if it's relaxed (handle expectByte underrun in every mid-codepoint state by potentially returning a partial string token up to that point).

gcoakes commented 3 months ago

begins was exactly at byte position 0x2000, which is a multiple of the default buffer size

Damn, you caught another of my mistakes. I had a hunch about buffer sizes being related. I checked to see if the split point lay on a page boundary by using VSCode's UI which shows the number of selected characters. I didn't think about character != byte.

thejoshwolfe commented 3 months ago

You've found a bug! This logic: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner.zig#L1029-L1032 is missing from these state handlers: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner.zig#L1256-L1325

do you consider it to be an API requirement for all partial_string tokens to contain only complete codepoints?

My original intent was that partial strings might end mid-codepoint, and you'd need to concatenate the bytes to get a complete understanding. Parsing into std.json.Value would handle that no problem. But given that this bug exists, I clearly never tested this.

I believe that the bugfix will be to stop partial strings mid codepoint, but I could also see an argument for emitting one of the .partial_string_escaped_* tokens for the buffer-spanning codepoint. That would mean that tokens emitted from the scanner are always dealing with coherent codepoints. However, I'm not sure why this would be valuable. There are a lot of bad reasons to care about codepoint boundaries when writing string handling code, so it's not obviously desirable to cater to that requirement. If you're doing UTF-8 validation (which is exactly where this std.json.Scanner code got tripped up) or encoding conversion, you'd definitely want coherent codepoints. I'm open to the idea of guaranteeing codepoint coherence on the Token boundary, but I suspect it will make the code slightly slower.

In the absence of strong arguments one way or the other, I'll probably try to just do whatever is less complex to implement.

thejoshwolfe commented 3 months ago

I clearly never tested this.

I do have this test: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner_test.zig#L352

but this always puts the non-ascii codepoints at the start of the string, which misses the part where we drop partial strings. :facepalm:

thejoshwolfe commented 3 months ago

which misses the part where we drop partial strings

Actually, the test was reproducing the issue, but failing to catch it in the test assertions. I just forgot to enumerate the partial string token types in expectEqualTokens(). :facepalm: :facepalm:

ziglang / zig