ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.62k stars 2.46k forks source link

proposal: a more consistent std.hash api #17048

Open tiehuis opened 1 year ago

tiehuis commented 1 year ago

Problem

If we inspect std.hash and look at their main hashing routine, we currently see a lot of different signatures.

Adler32.hash(input: []const u8) u32
CityHash64.hash(str: []const u8) u64
CityHash64.hashWithSeed(str: []const u8, seed: u64) u64
CityHash64.hashWithSeeds(str: []const u8, seed0: u64, seed1: u64) u64
Wyhash.hash(seed: u64, input: []const u8) u64
XxHash64.hash(seed: u64, input: anytype) u64

Proposal

Consolidate these apis into the following form so we have consistency between all hashing funtions:

// Options is a specific Options struct associated with that Hash function. T should be any unsigned integer.
pub fn hash(input: []const u8, options: Options) T

Example usage for some of the above then may look something like:

const buf = "test";

Adler32.hash(buf, .{});
CityHash64.hash(buf, .{});
CityHash64.hash(buf, .{ .seed = 0 });
CityHash64.hash(buf, .{ .seed_double = .{ 0, 1 });
Wyhash.hash(buf, .{});
XxHash64.hash(buf, .{ .seed = 0 });

I would suggest the options struct vs. a bare int (type T) for future compatibility for any other parameters (even though most currently only may have custom seed options). An example where we may have a non-seed parameter is with XXH3, which allows a custom secret: https://github.com/tiehuis/zig-xxh3/blob/c58f2c7f8731dd79559d988706bfae8a56c2a872/xxh3.zig#L150.

A similar model is already used in the crypto namespace: https://github.com/ziglang/zig/blob/bb2eb4443034d0a8c051f5eef184062078502ae8/lib/std/crypto/sha3.zig#L128

tiehuis commented 1 year ago

To clarify as well, this would involve changing the init function to take the same Options parameter.

dweiller commented 1 year ago

For anyone wondering why xxhash uses anytype currently, the idea is to allow passing both arrays and slices (you can't use arbitrary types). Passing an array should allow the compiler to optimize better, especially for small input lengths. However currently (or at least when the xxhash perf PR was merged, see there for more details) the compiler is inconsistent on whether it optimizes the array implementation better than slices (for speed at least).

Overall I'm not 100% convinced the anytype API is worth the API complexity (i.e. anytype doesn't make clear what you can pass). Perhaps when someone really cares about performance and they know they are hashing [N]u8 they should just make a customized implementation.

dweiller commented 1 year ago

I agree that we should have more consistency, but I don't think I like having an options struct that pretty much always has 0 or 1 fields. I don't think there is much chance that a given hash will have more options added over time, which removes the utility of structs to allow non-breaking API extension via fields with default arguments.

I'd prefer to have some standard functions covering the common options, probably just

pub fn hash(input: []const u8)
pub fn hashSeeded(input: []const u8, seed: T0) T1

and have hashes that use other exotic options just implement other hash* variants; currently the only one would be CityHash64 for the two-seed version and a future XXH3 implementation could have hashWithSecret(). I think this achieves consistency between hash functions (assuming there is a sensible default seed for wyhash; xxhash can just use 0) and is (to me) more ergonomic at the call-site.

The main reason I would think to use an options struct is if it makes generic/comptime use easier, but for that you either need to use a lowest-common-denominator set of fields (which will most likely just be the seed) or do hash-specific logic which could then call a different function anyway. If there really is a generic pattern that would benefit from an options struct I'd just throw in a standard hashOptions([]const u8, options: Options) T, but I probably can't think of a generic pattern that really benefits from this without actually trying to implement something.

p7r0x7 commented 1 year ago

Before we solidify this, I'd like to offer my two cents as someone with two years of hashing algorithm implementation and design experience in Go.

My biggest issue with Go's standard hash function implementation was its inflexibility; hash functions of the future will work differently, and many of them will be highly parallelized and/or support arbitrary amounts of output (in addition to the standard requirement that it must support an arbitrary number of input bytes/bits). Some hash function designs support changing internal state sizes as well. For these reasons, the standard interface that Zig ends up using should avoid requiring defined constants for output hash sizes and block sizes.

It's been a hot minute since I worked with Go as I am now studying Zig, but if I remember correctly those were the main issues I encountered conforming my code to Go's rather fixed standard hash interface.

More generally, I just think all the standard interfaces for hashers, encoders, encryptors, PRNGs, etc should be extremely flexible and still work when we need to reinvent how these things work. Flexibility is very core to this language and we shouldn't codify any standards that limit that. TY, this has been my slightly inebriated Ted Talk while watching One Piece in the background.

renxida commented 11 months ago

I don't like the "options" thing.

This makes for one more thing I need to look up when I'm writing a function call (as opposed to just relying on autocomplete).

jedisct1 commented 11 months ago

@renxida This is an autocomplete issue. Unless the options type is anytype, available options can be shown.

It's a decent way to avoid breaking changes. Even the existing functions can evolve other time. Having options allows for incremental additions without breaking the API.

tiehuis commented 10 months ago

Status-quo I would expect we will still reorder some of the arguments around but retain the implementation-specific details in the main hash function as required and don't bother making a standard hash signature across all types.

Adler32.hash(input: []const u8) u32
CityHash64.hash(str: []const u8) u64
CityHash64.hashWithSeed(seed: u64, str: []const u8) u64
CityHash64.hashWithSeeds(seed0: u64, seed1: u64, str: []const u8) u64
Wyhash.hash(seed: u64, input: []const u8) u64
XxHash64.hash(seed: u64, input: anytype) u64
houghtonap commented 4 months ago

I'm also struggling here with the variety of signatures for hash functions and think the current crop of hash function could be simplified and have expanded compile time/runtime usefulness. Besides the variety of signatures, the hash() functions usually work in the compile time/runtime context, while the update() and final() only work in the runtime context. Being able to construct a hash from multiple pieces in the compile time context is important. Let's look at the Fnv1a hash function, as specified.

pub const Fnv1a_32 = Fnv1a(u32, 0x01000193, 0x811c9dc5);
pub const Fnv1a_64 = Fnv1a(u64, 0x100000001b3, 0xcbf29ce484222325);
pub const Fnv1a_128 = Fnv1a(u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d);

fn Fnv1a(comptime T: type, comptime prime: T, comptime offset: T) type {
    return struct {
        const Self = @This();

        value: T,

        pub fn init() Self {
            return Self{ .value = offset };
        }

        pub fn update(self: *Self, input: []const u8) void {
            for (input) |b| {
                self.value ^= b;
                self.value *%= prime;
            }
        }

        pub fn final(self: *Self) T {
            return self.value;
        }

        pub fn hash(input: []const u8) T {
            var c = Self.init();
            c.update(input);
            return c.final();
        }
    };
}

This Fnv1a standard library code could be reduced to two functions: hash() and done() and allow the hash to be constructed at compile time or runtime using a variety of scenarios. Let's look at the reduced code, then at the different usage scenarios that are possible with the reduced code for constructing the hash.

pub const Fnv1a_32 = Fnv1a(u32, 0x01000193, 0x811c9dc5);
pub const Fnv1a_64 = Fnv1a(u64, 0x100000001b3, 0xcbf29ce484222325);
pub const Fnv1a_128 = Fnv1a(u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d);

fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type {
    return struct {
        const Self = @This();

        value: T,

        pub fn done(self: Self) T {
            return self.value;
        }

        pub fn hash(self: ?Self, input: []const u8) Self {
            const s = self orelse Self{ .value = offset };
            var v = s.value;

            for (input) |b| {
                v ^= b;
                v *%= prime;
            }

            return Self{ .value = v };
        }
    };
}

These are the usage scenarios for constructing single-valued and multi-valued hashes.

  1. Single-valued hash

    const fnv1a = Fnv1a_32;
    
    const fb1 = fnv1a.hash( null, "foobarbaz" ).done() ;
  2. Multi-valued hash, separate invocations.

    const fnv1a = Fnv1a_32;
    
    const foo = fnv1a.hash( null, "foo" );
    const bar = fnv1a.hash( foo, "bar" );
    const baz = fnv1a.hash( bar, "baz" );
    const fb2 = baz.done();
  3. Multi-valued, nested invocation.

    const fnv1a = Fnv1a_32;
    
    const fb3 = fnv1a.hash( fnv1a.hash( fnv1a.hash( null, "foo" ), "bar" ), "baz" ).done();
  4. Multi-valued, chained invocation.

    const fnv1a = Fnv1a_32;
    
    const fb4 = fnv1a.hash( null, "foo" ).hash( "bar" ).hash( "baz" ).done();

All these usage scenarios produce the same resulting hash at compile time or runtime without having separate functions that only deal with compile time or runtime. Here is the complete example:

pub const Fnv1a_32 = Fnv1a(u32, 0x01000193, 0x811c9dc5);
pub const Fnv1a_64 = Fnv1a(u64, 0x100000001b3, 0xcbf29ce484222325);
pub const Fnv1a_128 = Fnv1a(u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d);

fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type {
    return struct {
        const Self = @This();

        value: T,

        pub fn done(self: Self) T {
            return self.value;
        }

        pub fn hash(self: ?Self, input: []const u8) Self {
            const s = self orelse Self{ .value = offset };
            var v = s.value;
            for (input) |b| {
                v ^= b;
                v *%= prime;
            }
            return Self{ .value = v };
        }
    };
}

const fnv1a = Fnv1a_32 ;

const fb1 = fnv1a.hash(null, "foobarbaz").done();
const foo = fnv1a.hash(null, "foo");
const bar = fnv1a.hash(foo, "bar");
const baz = fnv1a.hash(bar, "baz");
const fb2 = baz.done();
const fb3 = fnv1a.hash(fnv1a.hash(fnv1a.hash(null, "foo"), "bar"), "baz").done();
const fb4 = fnv1a.hash(null, "foo").hash("bar").hash("baz").done();

const debug = @import("std").debug;

pub fn main() !void
{
    const fo6 = fnv1a.hash(null, "foo");
    const br6 = fnv1a.hash(fo6, "bar");

    var fb5 : u32 = undefined;
    var fb6 : u32 = undefined;
    var fb7 : u32 = undefined;
    var fb8 : u32 = undefined;

    fb5 = fnv1a.hash(null, "foobarbaz").done();
    fb6 = fnv1a.hash(br6, "baz").done();
    fb7 = fnv1a.hash(fnv1a.hash(fnv1a.hash(null, "foo"), "bar"), "baz").done();
    fb8 = fnv1a.hash(null, "foo").hash("bar").hash("baz").done();

    debug.print("fb1 = 0x{x}\nfb2 = 0x{x}\nfb3 = 0x{x}\nfb4 = 0x{x}\n", .{fb1, fb2, fb3, fb4});
    debug.print("fb5 = 0x{x}\nfb6 = 0x{x}\nfb7 = 0x{x}\nfb8 = 0x{x}\n", .{fb5, fb6, fb7, fb8});
    return;
}
jedisct1 commented 4 months ago

That done() function wouldn't be very convenient, and would be very unusual compared to hash function interfaces found everywhere else.

The incremental interfaces work fine in a comptime context, no need to invent a new interface:

    const hash = comptime h: {
        var h = Fnv1a_32.init();
        h.update("foo");
        h.update("bar");
        break :h h.final();
    };
houghtonap commented 4 months ago

That done() function wouldn't be very convenient, and would be very unusual compared to hash function interfaces found everywhere else.

The incremental interfaces work fine in a comptime context, no need to invent a new interface:

    const hash = comptime h: {
        var h = Fnv1a_32.init();
        h.update("foo");
        h.update("bar");
        break :h h.final();
    };

True, but with the redesigned code I proposed, you don't need to do anything different for a multi-valued hash in the compile time context. What you pointed out above, essentially amounts to a different interface for compile time vs. runtime for a multi-valued hash, It produces the same result, but why should the programmer have a different mental model for compile time vs. runtime in this situation? While what I proposed is simply: Fnv1a_32.init().hash("foo").hash("bar").done(); in either context. You could change done() to final() and add an init() as I just did. However, the original proposal requested that all hash functions should be changed to have a more consistent interface, As I indicated, I believe they could be with two or three functions: init(), hash() and either a done() for final() with better usage scenarios. For reference I made a small change to include an init() for hashes that are a little bulky for setup.

fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type {
    return struct {
        const Self = @This();

        value: T,

        pub fn init() Self {
        return Self{ .value = offset };
        }

        pub fn done(self: Self) T {
            return self.value;
        }

        pub fn hash(self: ?Self, input: []const u8) Self {
            const s = self orelse Self.init();
            var v = s.value;

            for (input) |b| {
                v ^= b;
                v *%= prime;
            }

            return Self{ .value = v };
        }
    };
}
jedisct1 commented 4 months ago

The interface is already exactly the same, just with a comptime keyword added if we want to evaluate the expression at comptime. init/update/final() is a well known pattern implemented in all programming languages, so being conservative here is nice to avoid confusing users.

A function with an strtok()-like interface that must be called the first time with null is a footgun.

Baring compiler optimizations, this also means that the state would be copied every time, which can have a performance impact at runtime.

This also prevents any further API simplification, such as adding options.

But there's something else. With the proposed API, on a call to hash(), it's impossible to know if the input is complete or partial. With the existing API, and unlike the streaming API (init/update/final),hash() assumes that the input is complete, so there's no need to duplicate it into a buffer, which enables optimizations. I think xxhash already takes advantage of this.

And if you really want to use a single function on comptime instead of a comptime block, concatenating slices works (a ++ b ++ c).

The current interface is indeed inconsistent, because they require different parameters at initialization time. But that can be solved by adding an Options parameter to init(). There are other inconsistencies, such as how functions and parameters are named, that can be easily fixed.

houghtonap commented 4 months ago

The interface is already exactly the same, just with a comptime keyword added if we want to evaluate the expression at comptime. init/update/final() is a well known pattern implemented in all programming languages, so being conservative here is nice to avoid confusing users.

Uhh... no the interfaces are not exactly the same. They are not the same for compile time invocation nor are they the same for runtime invocation.

The existing standard library interfaces for hash functions have 3 different interfaces:

  1. single-valued hash using hash( ) function for either compile time or runtime.
    const hash = @import( "std" ).hash ;
    const c = hash.Fnv1a_32.hash( "foobarbaz" ) ;
    var v = hash.Fnv1a_32.hash( "foobarbaz" ) ;
  2. multi-valued hash using init( ), update( ) and final( ) functions for runtime.
    const hash = @import( "std" ).hash ;
    var h = hash.Fnv1a_32.init( );
    h.update( "foo" ) ;
    h.update( "bar" ) ;
    h.update( "baz" );
    var v = h.final( ) ;
  3. multi-valued hash using comptime keyword, init( ), update( ) and final( ) functions for compile time.
    const hash = @import( "std" ).hash ;
    const v = comptime h:
    {
        var h = Fnv1a_32.init( ) ;
        h.update( "foo" ) ;
        h.update( "bar" ) ;
        h.update( "baz" ) ;
        break :h h.final( ) ;
    };

In order to explain the existing standard library interface you need to describe these three usage scenarios to the programmer.

Compare that to the proposed interface:

  1. single or multi-valued hash, compile or runtime, using init( ), hash( ) and final( ) (aka done( )) functions.

    const cs = Fnv1a_32.init( ).hash( "foobarbaz" ).final( ) ;
    const cm = Fnv1a_32.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;
    
    var vs = Fnv1a_32.init( ).hash( "foobarbaz" ).final( ) ;
    var vm = Fnv1a_32.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;

The proposed interface can be explained to the programmer as: invoke .init( ) to setup the hash algorithm, chain and invoke 0 or more times hash( ) to hash a byte array, chain and invoke final( ) to retrieve the final value.

A function with an strtok()-like interface that must be called the first time with null is a footgun.

In the proposed interface, with chaining, you don't need to invoke with null. In a prior example I short-circuited using the init( ) function which wasn't correct because it is required to be called to insure a hash algorithm is properly setup. The existing and proposed functions for multi-valued hashes require calling init( ) first to properly setup the hash algorithm and final( ) last to retrieve the final value.

Fnv1a_32.init( ).hash( "foobarbaz" ).final( ) ;

Baring compiler optimizations, this also means that the state would be copied every time, which can have a performance impact at runtime.

Performance is always a concern. My initial goal based on the original person's proposal was to have all hash functions to have the same signatures and to have a consistent interface so you could pass them into other functions just like allocators. I didn't look at performance, but because you are doing a lot of hand waving and we could endlessly discuss hypotheticals... in these situations the only realistic discussion about performance is to benchmark. I created a simple benchmark for the existing and proposed interface using single and multi-valued hashes in the runtime context. The results between them are insignificant.

Iteration Proposed Single Valued Time Diff Proposed Multi-Valued Time Diff Existing Single Valued Time Diff Existing Multi Valued Time Diff
01 28614 37626 31508 33637
02 28662 37670 31501 33412
03 28835 37623 31567 33332
04 28699 37658 31494 33468
05 28773 37648 31488 33423
06 28639 37647 31601 33345
07 28660 37636 31497 33389
08 28656 37642 31503 33468
09 28671 37653 31539 33441
10 28699 37659 31485 33585

Interestingly the existing standard library Fnv1a hash interface is slower than the proposed interface which probably accounts for the most widely used, use case by a programmer. I ran the benchmark multiple times and it seems to hold however, as I said the difference between the existing standard library and the proposed interfaces is insignificant for both single and multi-valued contexts.

Benchmark code:

const std = @import( "std" ) ;

const debug = std.debug ;
const hash  = std.hash ;
const time  = std.time ;

pub const Fnv1a_32  = Fnv1a( u32, 0x01000193, 0x811c9dc5 ) ;
pub const Fnv1a_64  = Fnv1a( u64, 0x100000001b3, 0xcbf29ce484222325 ) ;
pub const Fnv1a_128 = Fnv1a( u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d ) ;

fn Fnv1a( comptime T: type, comptime prime: T, comptime offset: T ) type
{
    return struct
    {

        value : T,

        const Self = @This();

        pub fn init( ) Self 
        {
        return Self { .value = offset };
        }

        pub fn final(self: Self) T
        {
            return self.value;
        }

        pub fn hash( self: ?Self, input: []const u8 ) Self
        {
            const s = self orelse Self.init( );
            var v = s.value;

            for ( input ) | b |
            {
                v ^= b;
                v *%= prime;
            }

            return Self { .value = v };
        }
    };
}

pub fn pr_single_value( fnv1a : anytype ) ! void 
{
    for ( 0 .. 10 ) | _ |
    {
        var tsBegin : i64 = undefined ;
        var tsEnd : i64 = undefined ; 

        tsBegin = time.milliTimestamp( ) ;

        for ( 0 .. 1_000_000_000 ) | _ |
        {
            var v: u32 = undefined ;

            v = fnv1a.init( ).hash( "foobarbaz" ).final( ) ;
        }

        tsEnd = time.milliTimestamp( );

        debug.print( "proposed runtime single-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
    }
    return ;
}

pub fn pr_multi_value( fnv1a : anytype ) ! void
{
    for ( 0 .. 10 ) | _ |
    {
        var tsBegin : i64 = undefined ;
        var tsEnd : i64 = undefined ; 

        tsBegin = time.milliTimestamp( ) ;

        for ( 0 .. 1_000_000_000 ) | _ |
        {
            var v: u32 = undefined ;

            v = fnv1a.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;
        }

        tsEnd = time.milliTimestamp( ) ;

        debug.print( "proposed runtime multi-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
    }
    return ;
}

pub fn er_single_value( fnv1a : anytype ) ! void
{
    for ( 0 .. 10 ) | _ |
    {
        var tsBegin: i64 = undefined ;
        var tsEnd: i64 = undefined ; 

        tsBegin = time.milliTimestamp( ) ;

        for ( 0 .. 1_000_000_000 ) | _ |
        {
            var v : u32 = undefined ;

            v = fnv1a.hash( "foobarbaz" ) ;
        }

        tsEnd = time.milliTimestamp( ) ;

        debug.print( "existing runtime single-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
    }
    return ;
}

pub fn er_multi_value( fnv1a : anytype ) ! void
{
    for ( 0 .. 10 ) | _ |
    {
        var tsBegin : i64 = undefined ;
        var tsEnd : i64 = undefined ; 

        tsBegin = time.milliTimestamp( ) ;

        for ( 0 .. 1_000_000_000 ) | _ |
        {
            var h = fnv1a.init( ) ;
            var v : u32 = undefined ;

            h = fnv1a.init( ) ;

            h.update( "foo" ) ;
            h.update( "bar" ) ;
            h.update( "baz" ) ;

            v = h.final( ) ;
        }

        tsEnd = time.milliTimestamp( ) ;

        debug.print( "existing runtime multi-valued time diff = {d}\n", .{ tsEnd - tsBegin } ) ;
    }
    return ;
}

pub fn main( ) ! void
{
    // benchmark proposed interface

    try pr_single_value( Fnv1a_32 ) ;
    try pr_multi_value( Fnv1a_32 ) ;

    // benchmark existing interface

    try er_single_value( hash.Fnv1a_32 ) ;
    try er_multi_value( hash.Fnv1a_32 ) ;

    return ;
}

This also prevents any further API simplification, such as adding options.

This argument is called future proofing an API. The crux of the argument is you cannot do something now because it is perceived to prevent some future possible outcome, that may or never occur. It's an indication of being resistant to change. Let's put that argument aside and look at adding options. Some hashes allow their seed, primes and other values to be changed for specific use cases. So how would options be added? The existing standard library Fnv1a hash function provides some insight into handling this:

pub const Fnv1a_32  = Fnv1a( u32, 0x01000193, 0x811c9dc5 ) ;
pub const Fnv1a_64  = Fnv1a( u64, 0x100000001b3, 0xcbf29ce484222325 ) ;
pub const Fnv1a_128 = Fnv1a( u128, 0x1000000000000000000013b, 0x6c62272e07bb014262b821756295c58d ) ;

Basically options are passed into the function that returns the generic type. In this specific case it was done because the seed and hash values are different for different bit values, but other hash functions may have only one bit value, but allow the seed and hash values to change based on unique use cases, i.e., CRC32 comes to mind. However, the changes can be handled in exactly the same fashion as the existing standard library Fnv1a hash function. Need to pass a lot of options into the function, then create a structure interface for them. These types of options occur at the algorithm level.

It is possible that some options may occur at the invocation level. How would these be handled? Neither the standard library hash functions, I believe, nor the proposed interface have invocation level options. However, these could simply be handled by passing in the options to the required init( ) function, if someday these types of options are really needed.

This is basically a non-issue with both the existing or proposed interfaces.

But there's something else. With the proposed API, on a call to hash(), it's impossible to know if the input is complete or partial. With the existing API, and unlike the streaming API (init/update/final),hash() assumes that the input is complete, so there's no need to duplicate it into a buffer, which enables optimizations.

You are not comparing apples to apples. The existing hash( ) function is one and done. There is no equivalent in the proposed interface, because there is no need. The existing hash interfaces have this specialized one and done hash( ) function which caused the existing standard library to have two separate interfaces that need to be explained to the programmer. However, if you compare the multi-valued context for the existing and proposed interfaces they both must follow the same init( ), update( ) or hash( ), final( ) pattern.

Existing interface for multi-valued hashes:

    const hash = @import( "std" ).hash ;
    var h = hash.Fnv1a_32.init( );
    h.update( "foo" ) ;
    h.update( "bar" ) ;
    h.update( "baz" );
    var v = h.final( ) ;

Proposed interface for multi-valued hashes:

    var vm = Fnv1a_32.init( ).hash( "foo" ).hash( "bar" ).hash( "baz" ).final( ) ;

Both multi-valued interfaces, require you to call init( ) to ensure internal state for the hash algorithm is setup properly and both require you to call final( ) to retrieve the final value. With each interface you don't know the ending value between calls to update( ) [existing] or hash( ) [proposed], this is why you are required to call final( ) to retrieve the final value, since the hash algorithm may need to adjust the calculated value before it becomes final. The existing standard library hash interface's hash( ) function just invokes the init( ), update( ) and final( ) pattern. In the proposed interface you don't need to explain to the programmer another separate interface just: invoke .init( ) to setup the hash algorithm, chain and invoke 0 or more times hash( ) to hash a byte array, chain and invoke final( ) to retrieve the final value.

jedisct1 commented 4 months ago

If I understand correctly, all of that can be summarized as "update() could return@Self()" (update() shouldn't be renamed to hash(); it would be misleading as H(H(A) ‖ B) ≠ H(A ‖ B)) . And for benchmarks, don't forget doNotOptimizeAway() and functions with large states.

Baring performance concerns, I'm not convinced that it would be an improvement, or that it addresses an actual issue. Writing the comptime keyword in a comptime context is not a different interface.

tiehuis commented 4 months ago

FNV is probably not the most representative hash to stress this api. It's a very simple accumulator that operates per-byte so internally is not that complex; there is no small-key optimizations. I would be reasonably confident that if using a block-based hashing function (e.g Wyhash), which requires keeping internal buffers for the init, update, final flow, will be noticeably worse perf-wise for small keys due to unnecessary internal data copying. A benchmark proving (or disproving) this would be welcome.

I assumed as well when I tried the benchmarking code that you used debug mode and not release when reproducing. Would be valuable to get a data-point in release mode, although will need to change the clock setup as milliseconds are not sufficient granularity.

See also #15916 for some high-level goals for hashing.

houghtonap commented 4 months ago

FNV is probably not the most representative hash to stress this api. It's a very simple accumulator that operates per-byte so internally is not that complex; there is no small-key optimizations. I would be reasonably confident that if using a block-based hashing function (e.g Wyhash), which requires keeping internal buffers for the init, update, final flow, will be noticeably worse perf-wise for small keys due to unnecessary internal data copying. A benchmark proving (or disproving) this would be welcome.

The short answer is that the proposed interface, in this regard, is not much different than the implementation in the existing standard library. So if there is a performance issue with block hash functions in the standard library, the proposed interface will not change that. Let's survey the differences between the standard library implementation interface and the proposed interface.

The standard library interface (existing):

The proposed interface:

I assumed as well when I tried the benchmarking code that you used debug mode and not release when reproducing. Would be valuable to get a data-point in release mode, although will need to change the clock setup as milliseconds are not sufficient granularity.

Yes, I used debug mode, Windows 10 64-bit with latest SP, i7-10700@2.9GHz, 32GB. I used milliseconds instead of nanoseconds to reduce any interference from background processes on the computer which is running MySQL and some other heavy weight processes. I could change the benchmark to use nanoseconds and see whether that matters, as well as provide release timings.

See also #15916 for some high-level goals for hashing.

Thanks for the pointer!!

tiehuis commented 4 months ago

because the hash( ) interface is a wrapper for the multi-value interface it requires changes whenever parameters are passed to the init( ) function.

This assumption is false. It is not always a wrapper.

You can see this for example here: https://github.com/ziglang/zig/blob/20b9b54e6b276c993f723f5aa3fafea72409a4fe/lib/std/hash/wyhash.zig#L178-L197

The key benefit of hash is that the full hashing length is provided up-front, we can get around redundant memcpy's and perform short-key optimizations.

Assembly comparing the two for reference: https://godbolt.org/z/En9KnGjsf


Just to note that the performance impact on short keys is signficant, using the in-tree benchmark.zig.

std

 (master =) $ ./benchmark --filter wyhash
wyhash
  small keys:  32B 16499 MiB/s 540664525 Hashes/s [f3ac0179e7c891a1]

hash replaced by iterative wrapper

--- a/lib/std/hash/wyhash.zig
+++ b/lib/std/hash/wyhash.zig
@@ -177,22 +177,8 @@ pub const Wyhash = struct {

     pub fn hash(seed: u64, input: []const u8) u64 {
         var self = Wyhash.init(seed);
-
-        if (input.len <= 16) {
-            self.smallKey(input);
-        } else {
-            var i: usize = 0;
-            if (input.len >= 48) {
-                while (i + 48 < input.len) : (i += 48) {
-                    self.round(input[i..][0..48]);
-                }
-                self.final0();
-            }
-            self.final1(input, i);
-        }
-
-        self.total_len = input.len;
-        return self.final2();
+        self.update(input);
+        return self.final();
     }
 };
 (master =) $ ./benchmark --filter wyhash
wyhash
  small keys:  32B  2678 MiB/s 87777023 Hashes/s [f3ac0179e7c891a1]
dweiller commented 4 months ago
  • uses a consistent programming model for both single and multi-value hashes which does not require multiple code usage scenarios for compile time vs. runtime contexts for a simplified programming model.

While I disagree that using the comptime keyword when wanting to hash at comptime constitutes a different interface, note that your proposed interface does not remove this requirement in comparison to the current one. For top-level declarations the comptime keyword is not needed for either interface - the difference between usage of your proposed interface is that it is possible to do chaining calls in a single statement which removes the need for using a block for a top-level declaration. The comptime keyword is required in the body of a function to perform hashing at comptime for both.

jedisct1 commented 4 months ago

Being able to chain calls requires all the inputs to be already known when evaluating the expression. Otherwise, an intermediary variable is still required anyways.