ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.09k stars 2.49k forks source link

Add reader.readLine family of functions to handle CR/LF and line ends on all platforms. #6754

Open IridescentRose opened 3 years ago

IridescentRose commented 3 years ago

I found this out while trying to use ZPM to add a package. This is not a ZPM bug as I found, but rather a bug in Zig's std.mem.concat.

Below is a simplified version of the error

const std = @import("std");

pub fn main() !void {
    const name = try std.io.getStdIn().reader().readUntilDelimiterAlloc(std.heap.page_allocator, '\n', 512);
    defer std.heap.page_allocator.free(name);
    std.debug.warn("\n", .{});

    var path = try std.mem.concat(std.heap.page_allocator, u8, &[_][]const u8{
        "packages/",
        name,
        ".json",
    });

    std.debug.warn("{}\n", .{path});
}

Input: myPackage Output on Ubuntu 20.04: packages/myPackage.zig Output on Windows: .jsonges/myPackage.zig

This means that the ".json" is overwriting the beginning of the buffer. What's weirder is that when iterating through the entire buffer when copying, every character is printed as the correct string packages/myPackage.zig but the returned result is the other bugged version.

IridescentRose commented 3 years ago
fn myConcat(allocator: *std.mem.Allocator, comptime T: type, slices: []const []const T) ![]T {
    if (slices.len == 0) return &[0]T{};

    var total_len: usize = 0;
    for (slices) |slice| {
        total_len += slice.len;
    }

    const buf = try allocator.alloc(T, total_len);
    errdefer allocator.free(buf);

    var buf_index: usize = 0;

    var sliceIdx: usize = 0;
    while(sliceIdx < slices.len): (sliceIdx += 1) {
        var i: usize = 0;
        while(i < slices[sliceIdx].len): (i += 1){
            buf[buf_index] = slices[sliceIdx][i];
            buf_index += 1;
            std.debug.print("{c}\n",.{buf[buf_index - 1]});
        }
    }

    return buf[0..];
}

This was my attempt to solve it. The debug print writes everything correctly and obviously buf_index is never reset to the beginning for this bug to occur.

LemonBoy commented 3 years ago

Odd bug

It's not a bug, it's a Windows feature :) readUntilDelimiterAlloc is reading until a \n is found... but Windows uses \r\n for its newlines! When you read in myPackage.zig what you really get is myPackage.zig\r and when the full path is built (ps. use fs.path.joinSep) you get something like packages/myPackage.zig\r.json.

Whenever you print the string to a terminal the \r shifts the cursor back to the beginning of the line, that's why the .json part is misplaced.

IridescentRose commented 3 years ago

That both explains it and also makes it more confusing. Perhaps we could create a new method like readUntilNewLineAlloc or even more simply readLineAlloc that handled it on a per-platform basis for simplicity?

Tetralux commented 3 years ago

This is why Zig needs a {q} format specifier. 😁

Tetralux commented 3 years ago

Cross-reference to https://github.com/ziglang/zig/issues/6756.

LemonBoy commented 3 years ago

Perhaps we could create a new method like readUntilNewLineAlloc or even more simply readLineAlloc that handled it on a per-platform basis for simplicity?

I like the idea of a readLineAlloc (what about readLineArrayList? and readLineOrEof?) but it should either recognize and chop away both CRLF and LF sequences or take an explicit parameter to control this behaviour, you can have files with LF endings on Windows and files with CRLF endings on Linux.

IridescentRose commented 3 years ago

I think detection should be the main feature since you could make it explicit with the existing untilDelimiter call

LemonBoy commented 3 years ago

I think detection should be the main feature since you could make it explicit with the existing untilDelimiter call

That's why I suggested it should remove both CRLF and LF endings, this way you get a simple method that reads a platform-independent line.

And readUntilDelimiterAlloc accepts a single byte as delimiter, reading CRLF requires you to read the line and cut off the trailing CR.

fivemoreminix commented 3 years ago

There's already a comptime delimiter constant in the standard library that could be used to drive a readLineAlloc or whatever

drfuchs commented 3 years ago

40+ years of experience with handling text files cross-platform speaking here: "On all platforms, the standard read-a-line functions should always consider CR, LF, and CRLF to be line terminators (and yes, that means that if you see a CR, also eat a single LF if one appears immediately after it). Also, keep in mind that there might not be any of these characters at the end of the last line in a text file." Doing it this way means that all programs will automatically handle text files that have been transferred from disparate OSes by disparate means.

Now, get off my lawn.

data-man commented 3 years ago

should always consider CR, LF, and CRLF to be line terminators

It was before Unicode era. :) From https://en.wikipedia.org/wiki/Newline#Unicode:

The Unicode standard defines a number of characters that conforming applications should recognize as line terminators: LF: Line Feed, U+000A VT: Vertical Tab, U+000B FF: Form Feed, U+000C CR: Carriage Return, U+000D CR+LF: CR (U+000D) followed by LF (U+000A) NEL: Next Line, U+0085 LS: Line Separator, U+2028 PS: Paragraph Separator, U+2029

drfuchs commented 3 years ago

Indeed! Note that the Unicode list of line terminators is a strict superset of mine, including the CR/LF/CRLF complication. And, hey, even back in the day we had FF (which was seldom seen, and typically caused a "clear screen" operation when displayed, and an "eject page" when printed), and even VT (even more infrequent in the wild, and nobody knew what it was really supposed to do, particularly as differentiated from LF). But NEL, LS, PS? You kids these days, with your fancy emo-gees.

data-man commented 3 years ago

But NEL, LS, PS? You kids these days, with your fancy emo-gees.

E.g., D supports them.