ydb-platform / ydb

YDB is an open source Distributed SQL Database that combines high availability and scalability with strong consistency and ACID transactions
https://ydb.tech
Apache License 2.0
4k stars 567 forks source link

get rid of std::hash #11591

Open lll-phill-lll opened 11 hours ago

lll-phill-lll commented 11 hours ago

As was noted by @vladl2802 in #11416 we have non even distribution of values between buckets while spilling.

The root couse of it is that we rely on a hash function here: https://github.com/ydb-platform/ydb/blob/6dac5e0e841e4c2ec2d57f437433eb14f716781c/yql/essentials/minikql/comp_nodes/mkql_wide_combine.cpp#L489

which appears to be std::hash which just returns the value itself: https://godbolt.org/z/es8dxMGeY

Hash function is set here: https://github.com/ydb-platform/ydb/blob/c8e6180cc6fc2d5276882e6a40cb4f7e189db42c/yql/essentials/public/udf/udf_type_ops.h#L44

As a temp measure we change the algorithm of bucket selection from hash%128 to XXHASH(hash)%128. pr: #11471

Also, with std::hash we can face compatibility issues while changing MKQL_RUNTIME version.

So, the proposal of this task is to change std::hash to some other hash function. Hash functions to consider: rh hash: https://github.com/ydb-platform/ydb/blob/c8e6180cc6fc2d5276882e6a40cb4f7e189db42c/yql/essentials/minikql/comp_nodes/mkql_rh_hash.h#L219 xxhash: https://github.com/Cyan4973/xxHash. We already use xxhash in GraceJoin: https://github.com/ydb-platform/ydb/blob/c8e6180cc6fc2d5276882e6a40cb4f7e189db42c/yql/essentials/minikql/comp_nodes/mkql_grace_join_imp.cpp#L78

lll-phill-lll commented 10 hours ago

Same problem we had in the hash shuffle algorithm for channels. We were trying to fix it here: #4364. But had to revert it because of the compatibility issues

yumkam commented 9 hours ago

XXH3_64Bits (from same xxh library) claimed to provide better performance for small-inputs (I have not benchmarked it)