`std.io.AnyWriter` has very high overhead

silversquirl commented 1 month ago

std.io.AnyWriter (and hence std.io.GenericWriter, since that uses AnyWriter internally) has a large amount of overhead in tight loops, as it hampers inlining.

The following code snippet demonstrates an extreme case of the issue, but it comes up a lot in practice too:

pub fn main() !void {
    const devnull = try std.fs.openFileAbsolute("/dev/null", .{ .mode = .write_only });
    var timer = try std.time.Timer.start();
    {
        // Write 1MiB to /dev/null, one byte at a time, using std.io.BufferedWriter
        var bufw = std.io.bufferedWriter(devnull.writer());
        const w = bufw.writer();
        for (0..1 << 20) |i| {
            try w.writeByte(@truncate(i));
        }
        try bufw.flush();

        std.debug.print("BufferedWriter: {}\n", .{std.fmt.fmtDuration(timer.read())});
    }

    {
        timer.reset();
        // Write 1MiB to /dev/null, one byte at a time, using a custom buffering solution
        var buf: [4096]u8 = undefined;
        var idx: usize = 0;
        const w = devnull.writer();
        for (0..1 << 20) |i| {
            buf[idx] = @truncate(i);
            idx += 1;

            if (idx >= buf.len) {
                try w.writeAll(&buf);
                idx = 0;
            }
        }
        try w.writeAll(buf[0..idx]);

        std.debug.print("Custom buffering: {}\n", .{std.fmt.fmtDuration(timer.read())});
    }
}

const std = @import("std");

EDIT (by @mlugg): the original numbers here were incorrect, since OP accidentally used -Doptimize with zig run. Accurate numbers from running this on my system are as follows:

[mlugg@nebula test]$ zig run repro.zig -OReleaseFast
BufferedWriter: 488.15us
Custom buffering: 272.103us

Using a custom buffering solution rather than std.io.BufferedWriter results in an approximately 2x speedup, simply by avoiding the overhead of std.io.AnyWriter.

~~This pretty much defeats the whole point of having BufferedWriter in the first place, since AnyWriter destroys the good write performance it would otherwise provide.~~ (EDIT @mlugg: this sentence was written based on incorrect performance numbers, which showed a wrongly exaggerated result.)

marler8997 commented 1 month ago

Interesting stuff. I'm getting the same numbers on my macbook air and Windows desktop. I tested the performance on some older releases of zig down to 0.10.0 and found the fastest to be 0.11.0. I copied the implementation from 0.11.0 into the example and it takes about 60% of the time compared to the new implementation that goes through AnyWriter, still far off from the "no writer" benchmark which is around 8% of the time. In addition I wrote a theoretical MetaWriter which gets rid of writeFn altogether and instead queries the underlying Context type to call an appropriate function and I got it to about 15% of the runtime (about 2x slower than having no writer). Here's that code:

const std = @import("std");

pub fn main() !void {
    const devnull = try std.fs.openFileAbsolute(
        if (@import("builtin").os.tag == .windows) "\\\\.\\NUL" else "/dev/null",
        .{ .mode = .write_only },
    );

    var timer = try std.time.Timer.start();

    for (0 .. 10) |_| {
        timer.reset();
        try testNoWriter(devnull);
        std.debug.print("No Writer: {}\n", .{std.fmt.fmtDuration(timer.read())});
        timer.reset();
        try testBufferedWriter(devnull);
        std.debug.print("BufferedWriter: {}\n", .{std.fmt.fmtDuration(timer.read())});
        timer.reset();
        try testBufferedWriterZig_0_11_0(devnull);
        std.debug.print("BufferedWriter(zig 0.11.0): {}\n", .{std.fmt.fmtDuration(timer.read())});
        timer.reset();
        try testBufferedWriterMeta(devnull);
        std.debug.print("BufferedWriterMeta: {}\n", .{std.fmt.fmtDuration(timer.read())});
    }
}

// Each example writes 1MiB to /dev/null, one byte at a time

// test today's buffered writer which goes through AnyWriter
fn testBufferedWriter(devnull: std.fs.File) !void {
    var bufw = std.io.bufferedWriter(devnull.writer());
    const w = bufw.writer();
    for (0..1 << 20) |i| {
        try w.writeByte(@truncate(i));
    }
    try bufw.flush();
}
// test buffered writer from 0.11.0 which didn't go through AnyWriter
fn testBufferedWriterZig_0_11_0(devnull: std.fs.File) !void {
    var bufw = zig_0_11_0.BufferedWriter(4096, std.fs.File.Writer){ .unbuffered_writer = devnull.writer() };
    const w = bufw.writer();
    for (0..1 << 20) |i| {
        try w.writeByte(@truncate(i));
    }
    try bufw.flush();
}
// test a theoretical writer interface that attempts to be as fast as having no writer
fn testBufferedWriterMeta(devnull: std.fs.File) !void {
    var bufw = BufferedWriterMeta(4096, std.fs.File.Writer){ .unbuffered_writer = devnull.writer() };
    const w = bufw.writer();
    for (0..1 << 20) |i| {
        try w.writeByte(@truncate(i));
    }
    try bufw.flush();
}
// test the performance we get without using any writer
fn testNoWriter(devnull: std.fs.File) !void {
    var buf: [4096]u8 = undefined;
    var idx: usize = 0;
    const w = devnull.writer();
    for (0..1 << 20) |i| {
        buf[idx] = @truncate(i);
        idx += 1;
        if (idx >= buf.len) {
            try w.writeAll(&buf);
            idx = 0;
        }
    }
    try w.writeAll(buf[0..idx]);
}

const mem = std.mem;
const assert = std.debug.assert;

// The old BufferedWriter from 0.11.0
const zig_0_11_0 = struct {
    pub fn GenericWriter(
        comptime Context: type,
        comptime WriteError: type,
        comptime writeFn: fn (context: Context, bytes: []const u8) WriteError!usize,
    ) type {
        return struct {
            context: Context,

            const Self = @This();
            pub const Error = WriteError;

            pub fn write(self: Self, bytes: []const u8) Error!usize {
                return writeFn(self.context, bytes);
            }

            pub fn writeAll(self: Self, bytes: []const u8) Error!void {
                var index: usize = 0;
                while (index != bytes.len) {
                    index += try self.write(bytes[index..]);
                }
            }

            pub fn writeByte(self: Self, byte: u8) Error!void {
                const array = [1]u8{byte};
                return self.writeAll(&array);
            }
        };
    }

    pub fn BufferedWriter(comptime buffer_size: usize, comptime WriterType: type) type {
        return struct {
            unbuffered_writer: WriterType,
            buf: [buffer_size]u8 = undefined,
            end: usize = 0,

            pub const Error = WriterType.Error;
            pub const Writer = GenericWriter(*Self, Error, write);

            const Self = @This();

            pub fn flush(self: *Self) !void {
                try self.unbuffered_writer.writeAll(self.buf[0..self.end]);
                self.end = 0;
            }

            pub fn writer(self: *Self) Writer {
                return .{ .context = self };
            }

            pub fn write(self: *Self, bytes: []const u8) Error!usize {
                if (self.end + bytes.len > self.buf.len) {
                    try self.flush();
                    if (bytes.len > self.buf.len)
                        return self.unbuffered_writer.write(bytes);
                }

                const new_end = self.end + bytes.len;
                @memcpy(self.buf[self.end..new_end], bytes);
                self.end = new_end;
                return bytes.len;
            }
        };
    }
};

// An even more "generic" version of the old GenericWriter that allows the implementation
// to define even more custom implementation other that just a writeFn
pub fn MetaWriter(
    comptime Context: type,
    comptime WriteError: type,
) type {
    return struct {
        context: Context,

        const Self = @This();
        pub const Error = WriteError;

        pub fn writeByte(self: Self, byte: u8) Error!void {
            // An example of how MetaWriter could forward functions to different implementations
            if (comptime @hasDecl(@typeInfo(Context).Pointer.child, "writeByte")) {
                try self.context.writeByte(byte);
            } else {
                const buf = [_]u8{ byte };
                const len = try self.context.write(&buf);
                std.debug.assert(len == 1);
            }
        }

    };
}
pub fn BufferedWriterMeta(comptime buffer_size: usize, comptime WriterType: type) type {
    return struct {
        unbuffered_writer: WriterType,
        buf: [buffer_size]u8 = undefined,
        end: usize = 0,

        pub const Error = WriterType.Error;
        pub const Writer = MetaWriter(*Self, Error);

        const Self = @This();

        pub fn flush(self: *Self) !void {
            try self.unbuffered_writer.writeAll(self.buf[0..self.end]);
            self.end = 0;
        }

        pub fn writer(self: *Self) Writer {
            return .{ .context = self };
        }

        pub fn writeByte(self: *Self, b: u8) Error!void {
            if (self.end == self.buf.len) {
                try self.flush();
            }
            self.buf[self.end] = b;
            self.end += 1;
        }

        pub fn write(self: *Self, bytes: []const u8) Error!usize {
            if (self.end + bytes.len > self.buf.len) {
                try self.flush();
                if (bytes.len > self.buf.len)
                    return self.unbuffered_writer.write(bytes);
            }

            const new_end = self.end + bytes.len;
            @memcpy(self.buf[self.end..new_end], bytes);
            self.end = new_end;
            return bytes.len;
        }
    };
}

P.S. I'm fairly surprised that the LLVM optimizer isn't able to optimize AnyWriter into what the no writer example is doing. Seems like there's lots of room to improve optimization in Zig.

AndrewKraevskii commented 1 month ago

-Doptimize=ReleaseFast doen't optimize code when using run command. Correct flag is -O ReleaseFast

Props to @amp-59 for noticing it

marler8997 commented 1 month ago

Thanks @AndrewKraevskii...major facepalm. Here's the numbers on my windows machine with -O ReleaseFast instead of -Doptimize=ReleaseFast.

No Writer: 342.3us
BufferedWriter: 471.8us
BufferedWriter(zig 0.11.0): 475.2us
BufferedWriterMeta: 1.102ms

Even more interesting...BufferedWriterMeta is now the slowest version! Computers are weird.

mlugg commented 1 month ago

OP also accidentally used -Doptimize rather than -O. I'm editing her post to give the numbers I get for her repro, with -OReleaseFast.

mlugg commented 1 month ago

Godbolt indicates that the example in the original issue is actually being correctly devirtualized and inlined by LLVM. This raises two obvious questions:

Given that the devirtualization and inlining passes seem to be working as expected, why is the performance significantly better when buffering is manually implemented?
What is the performance impact if devirtualization fails, and is this likely to happen?

AndrewKraevskii commented 1 month ago

I tried manually inlining call to write so no AnyWriter present and performance didn't change so devirtualization is indeed not a problem. Also noticed Custom buffering uses AnyWriter in w.writeAll() function anyway)

pub fn main() !void {
    const devnull = try std.fs.openFileAbsolute("/dev/null", .{ .mode = .write_only });
    var timer = try std.time.Timer.start();
    {
        timer.reset();
        // Write 1MiB to /dev/null, one byte at a time, using std.io.BufferedWriter
        var bufw = std.io.bufferedWriter(devnull.writer());
        const w = bufw.writer();
        for (0..1 << 20) |i| {
            // Change is here vvvvv
            // - try w.writeByte(@truncate(i));
            _ = try w.write(&.{@truncate(i)});
        }
        try bufw.flush();

        std.debug.print("BufferedWriter: {}\n", .{std.fmt.fmtDuration(timer.read())});
    }
    {
        timer.reset();
        // Write 1MiB to /dev/null, one byte at a time, using a custom buffering solution
        var buf: [4096]u8 = undefined;
        var idx: usize = 0;
        const w = devnull.writer();
        for (0..1 << 20) |i| {
            buf[idx] = @truncate(i);
            idx += 1;

            if (idx >= buf.len) {
                try w.writeAll(&buf);
                idx = 0;
            }
        }
        try w.writeAll(buf[0..idx]);

        std.debug.print("Custom buffering: {}\n", .{std.fmt.fmtDuration(timer.read())});
    }
}

const std = @import("std");

ethernetsellout commented 1 month ago

I went further and manually removed the .any() calls from GenericWriter, replacing them with the code from AnyWriter and replacing indirect calls with direct calls to writeFn. Again, the performance is unaffected. So this issue is totally unrelated to the performance of using AnyWriter, making this all the more peculiar. I think we need better benchmarks to judge this properly, but if this is a geniune performance issue it appears to be a more fundamental issue with Writer and not with dynamic dispatch in the standard library.

What is the performance impact if devirtualization fails, and is this likely to happen?

There is a severe problem with devirtualization in the case of standard library interfaces that are nested, such as ArenaAllocator or GeneralPurposeAllocator. However, GenericWriter doesn't actually store a child interface like those do, so I doubt devirtualization is ever an issue here.

castholm commented 1 month ago

To throw another oddity into the mix, on Windows, if I change "/dev/null" to "\\\\.\\NUL" and run with zig run repro.zig -OReleaseFast, I get the following measurements:

BufferedWriter: 4.835ms  
Custom buffering: 412.5us

But if I switch out writeByte for writeAll (still writing a single byte at a time)

         var bufw = std.io.bufferedWriter(devnull.writer());
         const w = bufw.writer();
         for (0..1 << 20) |i| {
-            try w.writeByte(@truncate(i));
+            try w.writeAll((&@as(u8, @truncate(i)))[0..1]);
         }
         try bufw.flush();

and run again, I get a 4x speed increase:

BufferedWriter: 1.109ms  
Custom buffering: 411.8us

When compiling for x86_64-linux-gnu and running with WSL, both the writeByte and writeAll versions take virtually the same amount of time (1.1ms), so whatever is causing this discrepancy appears to be Windows-specific.

nektro commented 1 month ago

gooncreeper commented 1 month ago

This is somewhat a two way issue, bad optimization and a lack of optimization opportunity. I have created a smaller reproduction of the problem,

extern fn file_write(u64, *const u64) void;

pub fn main() void {
    {
        var s: struct {
            x: u64 = 0,
            y: u64 = 0,
        } = .{};

        for (0..8192) |_| {
            s.x += 1;
            if (s.x % 16 == 0)
                file_write(s.x, &s.y);
        }
    } {
        var x: u64 = 0;
        var y: u64 = 0;

        for (0..8192) |_| {
            x += 1;
            if (x % 16 == 0)
                file_write(x, &y);
        }
    }
}

Looking at the generated code for the hot loop in the first block

.LBB3_2:
        dec     r14
        je      .LBB3_3
.LBB3_1:
**      mov     rdi, qword ptr [rsp]
        inc     rdi
**      mov     qword ptr [rsp], rdi
        test    dil, 15
        jne     .LBB3_2
        mov     rsi, rbx
        call    file_write@PLT
        jmp     .LBB3_2

The main culprit of the performance loss here is the loads and stores of the counter on every iteration. The reason why it has to be stored to the struct is because file_write could use @fieldParentPtr to access it. This is the performance impact of not having File.write inlined. On the other hand, the optimizer is also a culprit since it could move the loads and stores to only before and after the file_write function calls.

The hot loop of the second block is fine since @fieldParentPtr can't be used to access x

.LBB3_5:
        inc     rbx
        cmp     rbx, 8193
        je      .LBB3_6
.LBB3_4:
        test    bl, 15
        jne     .LBB3_5
        mov     rdi, rbx
        mov     rsi, r14
        call    file_write@PLT
        jmp     .LBB3_5

Edit

I just realized @fieldParentPtr keeps the const qualifier for the parent struct, so this is actually all bad optimization, though @fieldParentPtr is still likely the root of the problem.

ziglang / zig

`std.io.AnyWriter` has very high overhead #21566

Edit