ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.67k stars 2.53k forks source link

Blake3 implementation in std is slow #15375

Open matklad opened 1 year ago

matklad commented 1 year ago

Zig Version

0.10.1

Steps to Reproduce and Observed Behavior

See https://github.com/matklad/benchmarks/tree/1caed4cfdd2285f2f1946f592fc1d492fd9ed836/blake3 for a reproducible Rust vs Zig benchmark. The representative Result is

Rust
time  = 1.62s
MiB/s = 3859

Zig + ReleaseFast
time  = 6.074s
MiB/S = 1028

(this is pure Rust impl, assembly/C impls is a little bit further faster than that)

Zig results are approximately the same between 0.10.1 and 0.11.0-dev.2680+a1aa55ebe

This is relatively important for us at TigerBeetle.

Expected Behavior

Be as fast as Rust. Ideally, be as fast as asm.

jedisct1 commented 1 year ago

The current BLAKE3 implementation is just a direct port of the reference implementation. It's definitely slow.

As an alternative, Aegis MAC that we already have in stdlib is super fast on any platform with AES acceleration.

Also, KangarooTwelve is being standardized, is as fast as BLAKE3 or even faster on some platforms, and is very likely to be chosen over BLAKE3 in modern protocols. I expect it to be in the stdlib soon, with great performance.

marler8997 commented 1 year ago

I had compared Blake3 to Blake2b512 and Sha256a a while back with the following results:

// Blake3      : 1
// Blake2b512  : 1.3 x slower than Blake3
// Sha256      : 4 x slower than Blake3

Good to know we can improve to go about 4x faster than that even!

jedisct1 commented 1 year ago

Currently:

Apple M1:

BLAKE3 AEGIS-128L MAC AEGIS-128X MAC
1.56 GB/s 15.6 GB/s 14.5 GB/s

AMD EPYC (Zen 2)

BLAKE3 AEGIS-128L MAC AEGIS-128X MAC
4.90 GB/s 20.7 GB/s 31.9 GB/s

More recent benchmarks:

image

recursivetree commented 1 year ago

The rust implementation seems to more or less just import the c implementation, you could do that too in zig.

rpkak commented 4 months ago

I wrote an Blake3 implementation is Zig. Here are the benchmark results (hyperfine):

Start file:

```zig const std = @import("std"); const config = @import("config"); const blake3 = @import("./root.zig"); const c = @cImport({ @cInclude("blake3.h"); }); const stderr = std.io.getStdErr(); const stdout = std.io.getStdOut(); fn run() !void { if (std.os.argv.len != 2) { return error.OneArgRequired; } const fd = try std.posix.openZ(std.os.argv[1], .{}, undefined); defer std.posix.close(fd); const stat = try std.posix.fstat(fd); const area = try std.posix.mmap(null, @intCast(stat.size), 1, .{ .TYPE = .PRIVATE }, fd, 0); defer std.posix.munmap(area); var out: [32]u8 = undefined; if (config.c) { var hasher: c.blake3_hasher = undefined; c.blake3_hasher_init(&hasher); c.blake3_hasher_update(&hasher, area.ptr, area.len); c.blake3_hasher_finalize(&hasher, &out, out.len); } else if (config.std) { std.crypto.hash.Blake3.hash(area, &out, .{}); } else { blake3.Blake3(.{}).hash(area, &out); } try stdout.writer().print("{s}\n", .{std.fmt.bytesToHex(out, .lower)}); } pub fn main() u8 { run() catch |err| { stderr.writer().print("Error: {}\n", .{err}) catch unreachable; return 1; }; return 0; } ```

c-blake3 uses config.c = true and config.std = false and compiles blake3.c, blake3_dispatch.c, blake3_portable.c, blake3_sse2.c, blake3_sse41.c and blake3_avx2.c.

c-asm-blake3 uses config.c = true and config.std = false and compiles blake3.c, blake3_dispatch.c, blake3_portable.c, blake3_sse2_x86-64_unix.S, blake3_sse41_x86-64_unix.S and blake3_avx2_x86-64_unix.S.

zig-std-blake3 uses config.c = false and config.std = true.

zig-blake3 uses config.c = false and config.std = false (my impl).

Benchmark 1: b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
  Time (mean ± σ):     238.3 ms ±   7.2 ms    [User: 213.4 ms, System: 22.6 ms]
  Range (min … max):   230.7 ms … 265.2 ms    500 runs

Benchmark 2: ./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     229.7 ms ±   6.5 ms    [User: 207.3 ms, System: 22.1 ms]
  Range (min … max):   223.9 ms … 268.0 ms    500 runs

Benchmark 3: ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     268.3 ms ±  27.2 ms    [User: 245.8 ms, System: 22.2 ms]
  Range (min … max):   246.2 ms … 362.6 ms    500 runs

Benchmark 4: ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     240.3 ms ±   2.6 ms    [User: 217.4 ms, System: 22.6 ms]
  Range (min … max):   238.7 ms … 287.0 ms    500 runs

Benchmark 5: ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
  Time (mean ± σ):     925.5 ms ±  13.7 ms    [User: 903.9 ms, System: 21.1 ms]
  Range (min … max):   918.1 ms … 1064.6 ms    500 runs

Summary
  ./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso ran
    1.04 ± 0.04 times faster than b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
    1.05 ± 0.03 times faster than ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
    1.17 ± 0.12 times faster than ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
    4.03 ± 0.13 times faster than ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso

CPU: Intel Core 12700K

hyperfine.json

I will open a PR soon.

JonathanHallstrom commented 2 weeks ago

@rpkak How did it go? I can't find the code or a PR