Open tiehuis opened 1 year ago
To clarify as well, this would involve changing the init
function to take the same Options
parameter.
For anyone wondering why xxhash uses anytype
currently, the idea is to allow passing both arrays and slices (you can't use arbitrary types). Passing an array should allow the compiler to optimize better, especially for small input lengths. However currently (or at least when the xxhash perf PR was merged, see there for more details) the compiler is inconsistent on whether it optimizes the array implementation better than slices (for speed at least).
Overall I'm not 100% convinced the anytype
API is worth the API complexity (i.e. anytype
doesn't make clear what you can pass). Perhaps when someone really cares about performance and they know they are hashing [N]u8
they should just make a customized implementation.
I agree that we should have more consistency, but I don't think I like having an options struct that pretty much always has 0 or 1 fields. I don't think there is much chance that a given hash will have more options added over time, which removes the utility of structs to allow non-breaking API extension via fields with default arguments.
I'd prefer to have some standard functions covering the common options, probably just
pub fn hash(input: []const u8)
pub fn hashSeeded(input: []const u8, seed: T0) T1
and have hashes that use other exotic options just implement other hash*
variants; currently the only one would be CityHash64
for the two-seed version and a future XXH3 implementation could have hashWithSecret()
. I think this achieves consistency between hash functions (assuming there is a sensible default seed for wyhash; xxhash can just use 0) and is (to me) more ergonomic at the call-site.
The main reason I would think to use an options struct is if it makes generic/comptime use easier, but for that you either need to use a lowest-common-denominator set of fields (which will most likely just be the seed) or do hash-specific logic which could then call a different function anyway. If there really is a generic pattern that would benefit from an options struct I'd just throw in a standard hashOptions([]const u8, options: Options) T
, but I probably can't think of a generic pattern that really benefits from this without actually trying to implement something.
Before we solidify this, I'd like to offer my two cents as someone with two years of hashing algorithm implementation and design experience in Go.
My biggest issue with Go's standard hash function implementation was its inflexibility; hash functions of the future will work differently, and many of them will be highly parallelized and/or support arbitrary amounts of output (in addition to the standard requirement that it must support an arbitrary number of input bytes/bits). Some hash function designs support changing internal state sizes as well. For these reasons, the standard interface that Zig ends up using should avoid requiring defined constants for output hash sizes and block sizes.
It's been a hot minute since I worked with Go as I am now studying Zig, but if I remember correctly those were the main issues I encountered conforming my code to Go's rather fixed standard hash interface.
More generally, I just think all the standard interfaces for hashers, encoders, encryptors, PRNGs, etc should be extremely flexible and still work when we need to reinvent how these things work. Flexibility is very core to this language and we shouldn't codify any standards that limit that. TY, this has been my slightly inebriated Ted Talk while watching One Piece in the background.
I don't like the "options" thing.
This makes for one more thing I need to look up when I'm writing a function call (as opposed to just relying on autocomplete).
@renxida This is an autocomplete issue. Unless the options type is anytype
, available options can be shown.
It's a decent way to avoid breaking changes. Even the existing functions can evolve other time. Having options allows for incremental additions without breaking the API.
Status-quo I would expect we will still reorder some of the arguments around but retain the implementation-specific details in the main hash
function as required and don't bother making a standard hash signature across all types.
Adler32.hash(input: []const u8) u32
CityHash64.hash(str: []const u8) u64
CityHash64.hashWithSeed(seed: u64, str: []const u8) u64
CityHash64.hashWithSeeds(seed0: u64, seed1: u64, str: []const u8) u64
Wyhash.hash(seed: u64, input: []const u8) u64
XxHash64.hash(seed: u64, input: anytype) u64
I'm also struggling here with the variety of signatures for hash functions and think the current crop of hash function could be simplified and have expanded compile time/runtime usefulness. Besides the variety of signatures, the hash()
functions usually work in the compile time/runtime context, while the update()
and final()
only work in the runtime context. Being able to construct a hash from multiple pieces in the compile time context is important. Let's look at the Fnv1a
hash function, as specified.
pub const Fnv1a_32 = Fnv1a(u32, 0x01000193, 0x811c9dc5);
pub const Fnv1a_64 = Fnv1a(u64, 0x100000001b3, 0xcbf29ce484222325);
pub const Fnv1a_128 = Fnv1a(u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d);
fn Fnv1a(comptime T: type, comptime prime: T, comptime offset: T) type {
return struct {
const Self = @This();
value: T,
pub fn init() Self {
return Self{ .value = offset };
}
pub fn update(self: *Self, input: []const u8) void {
for (input) |b| {
self.value ^= b;
self.value *%= prime;
}
}
pub fn final(self: *Self) T {
return self.value;
}
pub fn hash(input: []const u8) T {
var c = Self.init();
c.update(input);
return c.final();
}
};
}
This Fnv1a
standard library code could be reduced to two functions: hash()
and done()
and allow the hash to be constructed at compile time or runtime using a variety of scenarios. Let's look at the reduced code, then at the different usage scenarios that are possible with the reduced code for constructing the hash.
pub const Fnv1a_32 = Fnv1a(u32, 0x01000193, 0x811c9dc5);
pub const Fnv1a_64 = Fnv1a(u64, 0x100000001b3, 0xcbf29ce484222325);
pub const Fnv1a_128 = Fnv1a(u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d);
fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type {
return struct {
const Self = @This();
value: T,
pub fn done(self: Self) T {
return self.value;
}
pub fn hash(self: ?Self, input: []const u8) Self {
const s = self orelse Self{ .value = offset };
var v = s.value;
for (input) |b| {
v ^= b;
v *%= prime;
}
return Self{ .value = v };
}
};
}
These are the usage scenarios for constructing single-valued and multi-valued hashes.
Single-valued hash
const fnv1a = Fnv1a_32;
const fb1 = fnv1a.hash( null, "foobarbaz" ).done() ;
Multi-valued hash, separate invocations.
const fnv1a = Fnv1a_32;
const foo = fnv1a.hash( null, "foo" );
const bar = fnv1a.hash( foo, "bar" );
const baz = fnv1a.hash( bar, "baz" );
const fb2 = baz.done();
Multi-valued, nested invocation.
const fnv1a = Fnv1a_32;
const fb3 = fnv1a.hash( fnv1a.hash( fnv1a.hash( null, "foo" ), "bar" ), "baz" ).done();
Multi-valued, chained invocation.
const fnv1a = Fnv1a_32;
const fb4 = fnv1a.hash( null, "foo" ).hash( "bar" ).hash( "baz" ).done();
All these usage scenarios produce the same resulting hash at compile time or runtime without having separate functions that only deal with compile time or runtime. Here is the complete example:
pub const Fnv1a_32 = Fnv1a(u32, 0x01000193, 0x811c9dc5);
pub const Fnv1a_64 = Fnv1a(u64, 0x100000001b3, 0xcbf29ce484222325);
pub const Fnv1a_128 = Fnv1a(u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d);
fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type {
return struct {
const Self = @This();
value: T,
pub fn done(self: Self) T {
return self.value;
}
pub fn hash(self: ?Self, input: []const u8) Self {
const s = self orelse Self{ .value = offset };
var v = s.value;
for (input) |b| {
v ^= b;
v *%= prime;
}
return Self{ .value = v };
}
};
}
const fnv1a = Fnv1a_32 ;
const fb1 = fnv1a.hash(null, "foobarbaz").done();
const foo = fnv1a.hash(null, "foo");
const bar = fnv1a.hash(foo, "bar");
const baz = fnv1a.hash(bar, "baz");
const fb2 = baz.done();
const fb3 = fnv1a.hash(fnv1a.hash(fnv1a.hash(null, "foo"), "bar"), "baz").done();
const fb4 = fnv1a.hash(null, "foo").hash("bar").hash("baz").done();
const debug = @import("std").debug;
pub fn main() !void
{
const fo6 = fnv1a.hash(null, "foo");
const br6 = fnv1a.hash(fo6, "bar");
var fb5 : u32 = undefined;
var fb6 : u32 = undefined;
var fb7 : u32 = undefined;
var fb8 : u32 = undefined;
fb5 = fnv1a.hash(null, "foobarbaz").done();
fb6 = fnv1a.hash(br6, "baz").done();
fb7 = fnv1a.hash(fnv1a.hash(fnv1a.hash(null, "foo"), "bar"), "baz").done();
fb8 = fnv1a.hash(null, "foo").hash("bar").hash("baz").done();
debug.print("fb1 = 0x{x}\nfb2 = 0x{x}\nfb3 = 0x{x}\nfb4 = 0x{x}\n", .{fb1, fb2, fb3, fb4});
debug.print("fb5 = 0x{x}\nfb6 = 0x{x}\nfb7 = 0x{x}\nfb8 = 0x{x}\n", .{fb5, fb6, fb7, fb8});
return;
}
That done()
function wouldn't be very convenient, and would be very unusual compared to hash function interfaces found everywhere else.
The incremental interfaces work fine in a comptime context, no need to invent a new interface:
const hash = comptime h: {
var h = Fnv1a_32.init();
h.update("foo");
h.update("bar");
break :h h.final();
};
That
done()
function wouldn't be very convenient, and would be very unusual compared to hash function interfaces found everywhere else.The incremental interfaces work fine in a comptime context, no need to invent a new interface:
const hash = comptime h: { var h = Fnv1a_32.init(); h.update("foo"); h.update("bar"); break :h h.final(); };
True, but with the redesigned code I proposed, you don't need to do anything different for a multi-valued hash in the compile time context. What you pointed out above, essentially amounts to a different interface for compile time vs. runtime for a multi-valued hash, It produces the same result, but why should the programmer have a different mental model for compile time vs. runtime in this situation? While what I proposed is simply: Fnv1a_32.init().hash("foo").hash("bar").done();
in either context. You could change done()
to final()
and add an init()
as I just did. However, the original proposal requested that all hash functions should be changed to have a more consistent interface, As I indicated, I believe they could be with two or three functions: init()
, hash()
and either a done()
for final()
with better usage scenarios. For reference I made a small change to include an init()
for hashes that are a little bulky for setup.
fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type {
return struct {
const Self = @This();
value: T,
pub fn init() Self {
return Self{ .value = offset };
}
pub fn done(self: Self) T {
return self.value;
}
pub fn hash(self: ?Self, input: []const u8) Self {
const s = self orelse Self.init();
var v = s.value;
for (input) |b| {
v ^= b;
v *%= prime;
}
return Self{ .value = v };
}
};
}
The interface is already exactly the same, just with a comptime
keyword added if we want to evaluate the expression at comptime. init/update/final()
is a well known pattern implemented in all programming languages, so being conservative here is nice to avoid confusing users.
A function with an strtok()
-like interface that must be called the first time with null
is a footgun.
Baring compiler optimizations, this also means that the state would be copied every time, which can have a performance impact at runtime.
This also prevents any further API simplification, such as adding options.
But there's something else. With the proposed API, on a call to hash()
, it's impossible to know if the input is complete or partial. With the existing API, and unlike the streaming API (init/update/final
),hash()
assumes that the input is complete, so there's no need to duplicate it into a buffer, which enables optimizations. I think xxhash
already takes advantage of this.
And if you really want to use a single function on comptime instead of a comptime block, concatenating slices works (a ++ b ++ c
).
The current interface is indeed inconsistent, because they require different parameters at initialization time. But that can be solved by adding an Options
parameter to init()
.
There are other inconsistencies, such as how functions and parameters are named, that can be easily fixed.
The interface is already exactly the same, just with a
comptime
keyword added if we want to evaluate the expression at comptime.init/update/final()
is a well known pattern implemented in all programming languages, so being conservative here is nice to avoid confusing users.
Uhh... no the interfaces are not exactly the same. They are not the same for compile time invocation nor are they the same for runtime invocation.
The existing standard library interfaces for hash functions have 3 different interfaces:
hash( )
function for either compile time or runtime.
const hash = @import( "std" ).hash ;
const c = hash.Fnv1a_32.hash( "foobarbaz" ) ;
var v = hash.Fnv1a_32.hash( "foobarbaz" ) ;
init( )
, update( )
and final( )
functions for runtime.
const hash = @import( "std" ).hash ;
var h = hash.Fnv1a_32.init( );
h.update( "foo" ) ;
h.update( "bar" ) ;
h.update( "baz" );
var v = h.final( ) ;
comptime
keyword, init( )
, update( )
and final( )
functions for compile time.
const hash = @import( "std" ).hash ;
const v = comptime h:
{
var h = Fnv1a_32.init( ) ;
h.update( "foo" ) ;
h.update( "bar" ) ;
h.update( "baz" ) ;
break :h h.final( ) ;
};
In order to explain the existing standard library interface you need to describe these three usage scenarios to the programmer.
Compare that to the proposed interface:
single or multi-valued hash, compile or runtime, using init( )
, hash( )
and final( )
(aka done( )
) functions.
const cs = Fnv1a_32.init( ).hash( "foobarbaz" ).final( ) ;
const cm = Fnv1a_32.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;
var vs = Fnv1a_32.init( ).hash( "foobarbaz" ).final( ) ;
var vm = Fnv1a_32.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;
The proposed interface can be explained to the programmer as: invoke .init( )
to setup the hash algorithm, chain and invoke 0 or more times hash( )
to hash a byte array, chain and invoke final( )
to retrieve the final value.
A function with an
strtok()
-like interface that must be called the first time withnull
is a footgun.
In the proposed interface, with chaining, you don't need to invoke with null
. In a prior example I short-circuited using the init( )
function which wasn't correct because it is required to be called to insure a hash algorithm is properly setup. The existing and proposed functions for multi-valued hashes require calling init( )
first to properly setup the hash algorithm and final( )
last to retrieve the final value.
Fnv1a_32.init( ).hash( "foobarbaz" ).final( ) ;
Baring compiler optimizations, this also means that the state would be copied every time, which can have a performance impact at runtime.
Performance is always a concern. My initial goal based on the original person's proposal was to have all hash functions to have the same signatures and to have a consistent interface so you could pass them into other functions just like allocators. I didn't look at performance, but because you are doing a lot of hand waving and we could endlessly discuss hypotheticals... in these situations the only realistic discussion about performance is to benchmark. I created a simple benchmark for the existing and proposed interface using single and multi-valued hashes in the runtime context. The results between them are insignificant.
Iteration | Proposed Single Valued Time Diff | Proposed Multi-Valued Time Diff | Existing Single Valued Time Diff | Existing Multi Valued Time Diff |
---|---|---|---|---|
01 | 28614 | 37626 | 31508 | 33637 |
02 | 28662 | 37670 | 31501 | 33412 |
03 | 28835 | 37623 | 31567 | 33332 |
04 | 28699 | 37658 | 31494 | 33468 |
05 | 28773 | 37648 | 31488 | 33423 |
06 | 28639 | 37647 | 31601 | 33345 |
07 | 28660 | 37636 | 31497 | 33389 |
08 | 28656 | 37642 | 31503 | 33468 |
09 | 28671 | 37653 | 31539 | 33441 |
10 | 28699 | 37659 | 31485 | 33585 |
Interestingly the existing standard library Fnv1a
hash interface is slower than the proposed interface which probably accounts for the most widely used, use case by a programmer. I ran the benchmark multiple times and it seems to hold however, as I said the difference between the existing standard library and the proposed interfaces is insignificant for both single and multi-valued contexts.
Benchmark code:
const std = @import( "std" ) ;
const debug = std.debug ;
const hash = std.hash ;
const time = std.time ;
pub const Fnv1a_32 = Fnv1a( u32, 0x01000193, 0x811c9dc5 ) ;
pub const Fnv1a_64 = Fnv1a( u64, 0x100000001b3, 0xcbf29ce484222325 ) ;
pub const Fnv1a_128 = Fnv1a( u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d ) ;
fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type
{
return struct
{
value : T,
const Self = @This();
pub fn init( ) Self
{
return Self { .value = offset };
}
pub fn final(self: Self) T
{
return self.value;
}
pub fn hash( self: ?Self, input: []const u8 ) Self
{
const s = self orelse Self.init( );
var v = s.value;
for ( input ) | b |
{
v ^= b;
v *%= prime;
}
return Self { .value = v };
}
};
}
pub fn pr_single_value( fnv1a : anytype ) ! void
{
for ( 0 .. 10 ) | _ |
{
var tsBegin : i64 = undefined ;
var tsEnd : i64 = undefined ;
tsBegin = time.milliTimestamp( ) ;
for ( 0 .. 1_000_000_000 ) | _ |
{
var v: u32 = undefined ;
v = fnv1a.init( ).hash( "foobarbaz" ).final( ) ;
}
tsEnd = time.milliTimestamp( );
debug.print( "proposed runtime single-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
}
return ;
}
pub fn pr_multi_value( fnv1a : anytype ) ! void
{
for ( 0 .. 10 ) | _ |
{
var tsBegin : i64 = undefined ;
var tsEnd : i64 = undefined ;
tsBegin = time.milliTimestamp( ) ;
for ( 0 .. 1_000_000_000 ) | _ |
{
var v: u32 = undefined ;
v = fnv1a.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;
}
tsEnd = time.milliTimestamp( ) ;
debug.print( "proposed runtime multi-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
}
return ;
}
pub fn er_single_value( fnv1a : anytype ) ! void
{
for ( 0 .. 10 ) | _ |
{
var tsBegin: i64 = undefined ;
var tsEnd: i64 = undefined ;
tsBegin = time.milliTimestamp( ) ;
for ( 0 .. 1_000_000_000 ) | _ |
{
var v : u32 = undefined ;
v = fnv1a.hash( "foobarbaz" ) ;
}
tsEnd = time.milliTimestamp( ) ;
debug.print( "existing runtime single-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
}
return ;
}
pub fn er_multi_value( fnv1a : anytype ) ! void
{
for ( 0 .. 10 ) | _ |
{
var tsBegin : i64 = undefined ;
var tsEnd : i64 = undefined ;
tsBegin = time.milliTimestamp( ) ;
for ( 0 .. 1_000_000_000 ) | _ |
{
var h = fnv1a.init( ) ;
var v : u32 = undefined ;
h = fnv1a.init( ) ;
h.update( "foo" ) ;
h.update( "bar" ) ;
h.update( "baz" ) ;
v = h.final( ) ;
}
tsEnd = time.milliTimestamp( ) ;
debug.print( "existing runtime multi-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
}
return ;
}
pub fn main( ) ! void
{
// benchmark proposed interface
try pr_single_value( Fnv1a_32 ) ;
try pr_multi_value( Fnv1a_32 ) ;
// benchmark existing interface
try er_single_value( hash.Fnv1a_32 ) ;
try er_multi_value( hash.Fnv1a_32 ) ;
return ;
}
This also prevents any further API simplification, such as adding options.
This argument is called future proofing an API. The crux of the argument is you cannot do something now because it is perceived to prevent some future possible outcome, that may or never occur. It's an indication of being resistant to change. Let's put that argument aside and look at adding options. Some hashes allow their seed, primes and other values to be changed for specific use cases. So how would options be added? The existing standard library Fnv1a
hash function provides some insight into handling this:
pub const Fnv1a_32 = Fnv1a( u32, 0x01000193, 0x811c9dc5 ) ;
pub const Fnv1a_64 = Fnv1a( u64, 0x100000001b3, 0xcbf29ce484222325 ) ;
pub const Fnv1a_128 = Fnv1a( u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d ) ;
Basically options are passed into the function that returns the generic type. In this specific case it was done because the seed and hash values are different for different bit values, but other hash functions may have only one bit value, but allow the seed and hash values to change based on unique use cases, i.e., CRC32 comes to mind. However, the changes can be handled in exactly the same fashion as the existing standard library Fnv1a
hash function. Need to pass a lot of options into the function, then create a structure interface for them. These types of options occur at the algorithm level.
It is possible that some options may occur at the invocation level. How would these be handled? Neither the standard library hash functions, I believe, nor the proposed interface have invocation level options. However, these could simply be handled by passing in the options to the required init( )
function, if someday these types of options are really needed.
This is basically a non-issue with both the existing or proposed interfaces.
But there's something else. With the proposed API, on a call to
hash()
, it's impossible to know if the input is complete or partial. With the existing API, and unlike the streaming API (init/update/final
),hash()
assumes that the input is complete, so there's no need to duplicate it into a buffer, which enables optimizations.
You are not comparing apples to apples. The existing hash( )
function is one and done. There is no equivalent in the proposed interface, because there is no need. The existing hash interfaces have this specialized one and done hash( )
function which caused the existing standard library to have two separate interfaces that need to be explained to the programmer. However, if you compare the multi-valued context for the existing and proposed interfaces they both must follow the same init( )
, update( )
or hash( )
, final( )
pattern.
Existing interface for multi-valued hashes:
const hash = @import( "std" ).hash ;
var h = hash.Fnv1a_32.init( );
h.update( "foo" ) ;
h.update( "bar" ) ;
h.update( "baz" );
var v = h.final( ) ;
Proposed interface for multi-valued hashes:
var vm = Fnv1a_32.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;
Both multi-valued interfaces, require you to call init( )
to ensure internal state for the hash algorithm is setup properly and both require you to call final( )
to retrieve the final value. With each interface you don't know the ending value between calls to update( )
[existing] or hash( )
[proposed], this is why you are required to call final( )
to retrieve the final value, since the hash algorithm may need to adjust the calculated value before it becomes final. The existing standard library hash interface's hash( )
function just invokes the init( )
, update( )
and final( )
pattern. In the proposed interface you don't need to explain to the programmer another separate interface just: invoke .init( )
to setup the hash algorithm, chain and invoke 0 or more times hash( )
to hash a byte array, chain and invoke final( )
to retrieve the final value.
If I understand correctly, all of that can be summarized as "update()
could return@Self()
" (update()
shouldn't be renamed to hash()
; it would be misleading as H(H(A) ‖ B) ≠ H(A ‖ B)
) . And for benchmarks, don't forget doNotOptimizeAway()
and functions with large states.
Baring performance concerns, I'm not convinced that it would be an improvement, or that it addresses an actual issue. Writing the comptime
keyword in a comptime context is not a different interface.
FNV is probably not the most representative hash to stress this api. It's a very simple accumulator that operates per-byte so internally is not that complex; there is no small-key optimizations. I would be reasonably confident that if using a block-based hashing function (e.g Wyhash), which requires keeping internal buffers for the init, update, final
flow, will be noticeably worse perf-wise for small keys due to unnecessary internal data copying. A benchmark proving (or disproving) this would be welcome.
I assumed as well when I tried the benchmarking code that you used debug mode and not release when reproducing. Would be valuable to get a data-point in release mode, although will need to change the clock setup as milliseconds are not sufficient granularity.
See also #15916 for some high-level goals for hashing.
FNV is probably not the most representative hash to stress this api. It's a very simple accumulator that operates per-byte so internally is not that complex; there is no small-key optimizations. I would be reasonably confident that if using a block-based hashing function (e.g Wyhash), which requires keeping internal buffers for the
init, update, final
flow, will be noticeably worse perf-wise for small keys due to unnecessary internal data copying. A benchmark proving (or disproving) this would be welcome.
The short answer is that the proposed interface, in this regard, is not much different than the implementation in the existing standard library. So if there is a performance issue with block hash functions in the standard library, the proposed interface will not change that. Let's survey the differences between the standard library implementation interface and the proposed interface.
The standard library interface (existing):
hash( )
- a specialized one and done interface that hashes a byte array to produce a hash value. Has the same code usage scenario for compile time vs. runtime. The implementation of this function is a wrapper for the second multi-value hash interface.init( )
, update( )
, final( )
- an interface that hashes any number of byte arrays (multi-value) to produce a hash value. Has different code usage scenarios for compile time (requires a comptime block) vs. runtime.hash( )
interface is a wrapper for the multi-value interface it requires changes whenever parameters are passed to the init( )
function. An example would be a hash algorithm that requires an allocator. Zig conventions dictate that the allocator should be passed in. The likely place that would be done is in the init( )
function which would require that the hash( )
function also be changed to accept the allocator so it can pass it to the init( )
function. There is a dependency between these 2 supported interfaces and it is demonstrated in Wyhash
with the seed
parameter. This dependency causes differences in the hash interface signatures which was why you started this thread.The proposed interface:
init( )
, hash( )
, final( )
- an interface that hashes any number of byte arrays (multi-value) to produce a hash value. Has the same code usage scenario for compile time vs. runtime. hash( )
function which simplifies the programming model.hash( )
function and renames the existing update( )
function to hash( )
.I assumed as well when I tried the benchmarking code that you used debug mode and not release when reproducing. Would be valuable to get a data-point in release mode, although will need to change the clock setup as milliseconds are not sufficient granularity.
Yes, I used debug mode, Windows 10 64-bit with latest SP, i7-10700@2.9GHz, 32GB. I used milliseconds instead of nanoseconds to reduce any interference from background processes on the computer which is running MySQL and some other heavy weight processes. I could change the benchmark to use nanoseconds and see whether that matters, as well as provide release timings.
See also #15916 for some high-level goals for hashing.
Thanks for the pointer!!
because the hash( ) interface is a wrapper for the multi-value interface it requires changes whenever parameters are passed to the init( ) function.
This assumption is false. It is not always a wrapper.
You can see this for example here: https://github.com/ziglang/zig/blob/20b9b54e6b276c993f723f5aa3fafea72409a4fe/lib/std/hash/wyhash.zig#L178-L197
The key benefit of hash
is that the full hashing length is provided up-front, we can get around redundant memcpy's and perform short-key optimizations.
Assembly comparing the two for reference: https://godbolt.org/z/En9KnGjsf
Just to note that the performance impact on short keys is signficant, using the in-tree benchmark.zig
.
(master =) $ ./benchmark --filter wyhash
wyhash
small keys: 32B 16499 MiB/s 540664525 Hashes/s [f3ac0179e7c891a1]
hash
replaced by iterative wrapper--- a/lib/std/hash/wyhash.zig
+++ b/lib/std/hash/wyhash.zig
@@ -177,22 +177,8 @@ pub const Wyhash = struct {
pub fn hash(seed: u64, input: []const u8) u64 {
var self = Wyhash.init(seed);
-
- if (input.len <= 16) {
- self.smallKey(input);
- } else {
- var i: usize = 0;
- if (input.len >= 48) {
- while (i + 48 < input.len) : (i += 48) {
- self.round(input[i..][0..48]);
- }
- self.final0();
- }
- self.final1(input, i);
- }
-
- self.total_len = input.len;
- return self.final2();
+ self.update(input);
+ return self.final();
}
};
(master =) $ ./benchmark --filter wyhash
wyhash
small keys: 32B 2678 MiB/s 87777023 Hashes/s [f3ac0179e7c891a1]
- uses a consistent programming model for both single and multi-value hashes which does not require multiple code usage scenarios for compile time vs. runtime contexts for a simplified programming model.
While I disagree that using the comptime
keyword when wanting to hash at comptime constitutes a different interface, note that your proposed interface does not remove this requirement in comparison to the current one. For top-level declarations the comptime
keyword is not needed for either interface - the difference between usage of your proposed interface is that it is possible to do chaining calls in a single statement which removes the need for using a block for a top-level declaration. The comptime
keyword is required in the body of a function to perform hashing at comptime for both.
Being able to chain calls requires all the inputs to be already known when evaluating the expression. Otherwise, an intermediary variable is still required anyways.
Problem
If we inspect std.hash and look at their main hashing routine, we currently see a lot of different signatures.
Proposal
Consolidate these apis into the following form so we have consistency between all hashing funtions:
Example usage for some of the above then may look something like:
I would suggest the options struct vs. a bare int (
type T)
for future compatibility for any other parameters (even though most currently only may have custom seed options). An example where we may have a non-seed parameter is with XXH3, which allows a custom secret: https://github.com/tiehuis/zig-xxh3/blob/c58f2c7f8731dd79559d988706bfae8a56c2a872/xxh3.zig#L150.A similar model is already used in the
crypto
namespace: https://github.com/ziglang/zig/blob/bb2eb4443034d0a8c051f5eef184062078502ae8/lib/std/crypto/sha3.zig#L128