Closed gcoakes closed 3 months ago
I'm not able to reproduce this (also on Zig 0.14.0-dev.839+a931bfada); both tests pass for me with the provided input file. Here's my system information (uname -a
), in the event this is somehow platform-specific:
Linux toolbox 6.9.10-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 18 21:39:30 UTC 2024 x86_64 GNU/Linux
I noticed that the output provided in the issue suggests there were originally 5 tests in the file, but the code has only 2 tests. While I assume the other 3 tests were just removed to make the example shorter and less cluttered, is it possible that any other parts of the reproduction tests or names.json
input file were changed prior to submitting the issue (e.g. as part of the various experiments you were trying)? Just trying to rule out any potential source of variance, since it sounds like this only reproduces under very specific circumstances.
is it possible that any other parts of the reproduction tests or names.json input file were changed prior to submitting the issue (e.g. as part of the various experiments you were trying)?
I just double checked the names.json
uploaded to this issue, and yes, it is different from the original file. I think it was from experiment 9 or 10. Good catch. I am not able to reproduce with that one either.
I just re-generated the original names.json
(via WABT's wast2json
utility). Uploaded here:
names.json. I am still able to reproduce the issue with that file.
there were originally 5 tests in the file, but the code has only 2 tests.
Correct, those tests were from some of the experiments I was running. Minimizing to just the following still reproduces the issue for me:
const std = @import("std");
test "fail #1 -- FixedBufferStream" {
var stream = std.io.fixedBufferStream(@embedFile("names.json"));
var reader = std.json.reader(std.testing.allocator, stream.reader());
defer reader.deinit();
const parsed = try std.json.parseFromTokenSource(std.json.Value, std.testing.allocator, &reader, .{});
defer parsed.deinit();
try std.testing.expectEqualSlices(
u8,
"\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0",
parsed.value.object.get("commands").?.array.items[51].object.get("action").?.object.get("field").?.string,
);
}
$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux gcoakes-laptop 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux
Thanks! That helps, I was able to reproduce the issue with your latest file. I happened to notice (looking at a hex dump of names.json
) that the point where the suffix of the string (what actually gets parsed in the test) begins was exactly at byte position 0x2000, which is a multiple of the default buffer size: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner.zig#L71 This seems to be the root of the issue, since the same problem can be reproduced on a much smaller input (just the affected string) by making the JSON reader buffer size small enough that it splits up the string at the same point:
const std = @import("std");
test {
const s = "\"\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0\"";
var stream = std.io.fixedBufferStream(s);
var reader = std.json.Reader(8, @TypeOf(stream.reader())).init(std.testing.allocator, stream.reader());
defer reader.deinit();
const parsed = try std.json.parseFromTokenSource(std.json.Value, std.testing.allocator, &reader, .{});
defer parsed.deinit();
try std.testing.expectEqualSlices(
u8,
"\xe1\x85\x9f\xe1\x85\xa0\xe3\x85\xa4\xef\xbe\xa0",
parsed.value.string,
);
}
This produces the same failure:
slices differ. first difference occurs at index 0 (0x0)
============ expected this output: ============= len: 12 (0xC)
E1 85 9F E1 85 A0 E3 85 A4 EF BE A0 ............
============= instead found this: ============== len: 5 (0x5)
85 A4 EF BE A0 .....
================================================
1/1 json.test_0...FAIL (TestExpectedEqual)
This issue seems to occur specifically when a UTF-8-encoded codepoint is split on a buffer length boundary: the state machine loop in Scanner.next
is using expectByte
for "mid-codepoint in string" states, and if that function reaches the end of the internal buffer, it will return error.BufferUnderrun
, which will cause the Reader
to refill its buffer, but that loses anything scanned in the string up to that point (such as the first 7 bytes of the string in this case).
@thejoshwolfe do you consider it to be an API requirement for all partial_string
tokens to contain only complete codepoints? Currently that is the case, but I don't see any documentation explicitly guaranteeing this, and I think this issue should be straightforward to solve if it's relaxed (handle expectByte
underrun in every mid-codepoint state by potentially returning a partial string token up to that point).
begins was exactly at byte position 0x2000, which is a multiple of the default buffer size
Damn, you caught another of my mistakes. I had a hunch about buffer sizes being related. I checked to see if the split point lay on a page boundary by using VSCode's UI which shows the number of selected characters. I didn't think about character != byte.
You've found a bug! This logic: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner.zig#L1029-L1032 is missing from these state handlers: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner.zig#L1256-L1325
do you consider it to be an API requirement for all
partial_string
tokens to contain only complete codepoints?
My original intent was that partial strings might end mid-codepoint, and you'd need to concatenate the bytes to get a complete understanding. Parsing into std.json.Value
would handle that no problem. But given that this bug exists, I clearly never tested this.
I believe that the bugfix will be to stop partial strings mid codepoint, but I could also see an argument for emitting one of the .partial_string_escaped_*
tokens for the buffer-spanning codepoint. That would mean that tokens emitted from the scanner are always dealing with coherent codepoints. However, I'm not sure why this would be valuable. There are a lot of bad reasons to care about codepoint boundaries when writing string handling code, so it's not obviously desirable to cater to that requirement. If you're doing UTF-8 validation (which is exactly where this std.json.Scanner
code got tripped up) or encoding conversion, you'd definitely want coherent codepoints. I'm open to the idea of guaranteeing codepoint coherence on the Token boundary, but I suspect it will make the code slightly slower.
In the absence of strong arguments one way or the other, I'll probably try to just do whatever is less complex to implement.
I clearly never tested this.
I do have this test: https://github.com/ziglang/zig/blob/a655c15c4004d553ea462652f69acd37e4514f79/lib/std/json/scanner_test.zig#L352
but this always puts the non-ascii codepoints at the start of the string, which misses the part where we drop partial strings. :facepalm:
which misses the part where we drop partial strings
Actually, the test was reproducing the issue, but failing to catch it in the test assertions. I just forgot to enumerate the partial string token types in expectEqualTokens()
. :facepalm: :facepalm:
Zig Version
0.14.0-dev.839+a931bfada
Steps to Reproduce and Observed Behavior
names.json should be located next to the following test:
Run the test and observe the failures:
The following experiments were done:
std.json.parseFromSlice
=> PassFixedBufferStream
=> FailBufferedReader(4096, FixedBufferStream)
(buffer boundary before fail point in JSON) => FailBufferedReader(8192, FixedBufferStream)
(buffer boundary after fail point in JSON) => Failnames.json
to just the failing string. => PassSome general observations:
names.json
at which this unicode string occurs. This implies to me that it is something related to either the parsers state machine or buffering. The latter is less likely considering experiments 6 and 7.Expected Behavior
The following command should illustrate what is expected to be read from that field. You can ignore the trailing newline (0a) that is caused by
jq
:std.json.parseFromTokenSource
should correctly handle unicode text within a string without corruption or truncation.