Open lll-phill-lll opened 11 hours ago
Same problem we had in the hash shuffle algorithm for channels. We were trying to fix it here: #4364. But had to revert it because of the compatibility issues
XXH3_64Bits
(from same xxh library) claimed to provide better performance for small-inputs (I have not benchmarked it)
As was noted by @vladl2802 in #11416 we have non even distribution of values between buckets while spilling.
The root couse of it is that we rely on a hash function here: https://github.com/ydb-platform/ydb/blob/6dac5e0e841e4c2ec2d57f437433eb14f716781c/yql/essentials/minikql/comp_nodes/mkql_wide_combine.cpp#L489
which appears to be std::hash which just returns the value itself: https://godbolt.org/z/es8dxMGeY
Hash function is set here: https://github.com/ydb-platform/ydb/blob/c8e6180cc6fc2d5276882e6a40cb4f7e189db42c/yql/essentials/public/udf/udf_type_ops.h#L44
As a temp measure we change the algorithm of bucket selection from
hash%128
toXXHASH(hash)%128
. pr: #11471Also, with std::hash we can face compatibility issues while changing MKQL_RUNTIME version.
So, the proposal of this task is to change std::hash to some other hash function. Hash functions to consider: rh hash: https://github.com/ydb-platform/ydb/blob/c8e6180cc6fc2d5276882e6a40cb4f7e189db42c/yql/essentials/minikql/comp_nodes/mkql_rh_hash.h#L219 xxhash: https://github.com/Cyan4973/xxHash. We already use xxhash in GraceJoin: https://github.com/ydb-platform/ydb/blob/c8e6180cc6fc2d5276882e6a40cb4f7e189db42c/yql/essentials/minikql/comp_nodes/mkql_grace_join_imp.cpp#L78