Open matklad opened 1 year ago
The current BLAKE3 implementation is just a direct port of the reference implementation. It's definitely slow.
As an alternative, Aegis MAC that we already have in stdlib is super fast on any platform with AES acceleration.
Also, KangarooTwelve is being standardized, is as fast as BLAKE3 or even faster on some platforms, and is very likely to be chosen over BLAKE3 in modern protocols. I expect it to be in the stdlib soon, with great performance.
I had compared Blake3 to Blake2b512 and Sha256a a while back with the following results:
// Blake3 : 1
// Blake2b512 : 1.3 x slower than Blake3
// Sha256 : 4 x slower than Blake3
Good to know we can improve to go about 4x faster than that even!
Currently:
BLAKE3 | AEGIS-128L MAC | AEGIS-128X MAC |
---|---|---|
1.56 GB/s | 15.6 GB/s | 14.5 GB/s |
BLAKE3 | AEGIS-128L MAC | AEGIS-128X MAC |
---|---|---|
4.90 GB/s | 20.7 GB/s | 31.9 GB/s |
std.crypto.auth.aegis.Aegis128LMac
)Aegis128XMac
)More recent benchmarks:
The rust implementation seems to more or less just import the c implementation, you could do that too in zig.
I wrote an Blake3 implementation is Zig. Here are the benchmark results (hyperfine):
```zig const std = @import("std"); const config = @import("config"); const blake3 = @import("./root.zig"); const c = @cImport({ @cInclude("blake3.h"); }); const stderr = std.io.getStdErr(); const stdout = std.io.getStdOut(); fn run() !void { if (std.os.argv.len != 2) { return error.OneArgRequired; } const fd = try std.posix.openZ(std.os.argv[1], .{}, undefined); defer std.posix.close(fd); const stat = try std.posix.fstat(fd); const area = try std.posix.mmap(null, @intCast(stat.size), 1, .{ .TYPE = .PRIVATE }, fd, 0); defer std.posix.munmap(area); var out: [32]u8 = undefined; if (config.c) { var hasher: c.blake3_hasher = undefined; c.blake3_hasher_init(&hasher); c.blake3_hasher_update(&hasher, area.ptr, area.len); c.blake3_hasher_finalize(&hasher, &out, out.len); } else if (config.std) { std.crypto.hash.Blake3.hash(area, &out, .{}); } else { blake3.Blake3(.{}).hash(area, &out); } try stdout.writer().print("{s}\n", .{std.fmt.bytesToHex(out, .lower)}); } pub fn main() u8 { run() catch |err| { stderr.writer().print("Error: {}\n", .{err}) catch unreachable; return 1; }; return 0; } ```
c-blake3
uses config.c = true
and config.std = false
and compiles blake3.c
, blake3_dispatch.c
, blake3_portable.c
, blake3_sse2.c
, blake3_sse41.c
and blake3_avx2.c
.
c-asm-blake3
uses config.c = true
and config.std = false
and compiles blake3.c
, blake3_dispatch.c
, blake3_portable.c
, blake3_sse2_x86-64_unix.S
, blake3_sse41_x86-64_unix.S
and blake3_avx2_x86-64_unix.S
.
zig-std-blake3
uses config.c = false
and config.std = true
.
zig-blake3
uses config.c = false
and config.std = false
(my impl).
Benchmark 1: b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
Time (mean ± σ): 238.3 ms ± 7.2 ms [User: 213.4 ms, System: 22.6 ms]
Range (min … max): 230.7 ms … 265.2 ms 500 runs
Benchmark 2: ./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 229.7 ms ± 6.5 ms [User: 207.3 ms, System: 22.1 ms]
Range (min … max): 223.9 ms … 268.0 ms 500 runs
Benchmark 3: ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 268.3 ms ± 27.2 ms [User: 245.8 ms, System: 22.2 ms]
Range (min … max): 246.2 ms … 362.6 ms 500 runs
Benchmark 4: ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 240.3 ms ± 2.6 ms [User: 217.4 ms, System: 22.6 ms]
Range (min … max): 238.7 ms … 287.0 ms 500 runs
Benchmark 5: ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
Time (mean ± σ): 925.5 ms ± 13.7 ms [User: 903.9 ms, System: 21.1 ms]
Range (min … max): 918.1 ms … 1064.6 ms 500 runs
Summary
./zig-out/bin/c-asm-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso ran
1.04 ± 0.04 times faster than b3sum /home/user/Downloads/archlinux-2024.04.01-x86_64.iso --num-threads 1
1.05 ± 0.03 times faster than ./zig-out/bin/zig-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
1.17 ± 0.12 times faster than ./zig-out/bin/c-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
4.03 ± 0.13 times faster than ./zig-out/bin/zig-std-blake3 /home/user/Downloads/archlinux-2024.04.01-x86_64.iso
CPU: Intel Core 12700K
I will open a PR soon.
@rpkak How did it go? I can't find the code or a PR
Zig Version
0.10.1
Steps to Reproduce and Observed Behavior
See https://github.com/matklad/benchmarks/tree/1caed4cfdd2285f2f1946f592fc1d492fd9ed836/blake3 for a reproducible Rust vs Zig benchmark. The representative Result is
(this is pure Rust impl, assembly/C impls is a little bit further faster than that)
Zig results are approximately the same between 0.10.1 and 0.11.0-dev.2680+a1aa55ebe
This is relatively important for us at TigerBeetle.
Expected Behavior
Be as fast as Rust. Ideally, be as fast as asm.