comparison with BinDash or GSearch?

jianshu93 commented 8 months ago

Hello Hyper-Gen team,

New version of Minhash-like algorithm called one permutaiton minhash with optimal/faster densification is by far the fastest and theoretical limit of such algorithms, implemented in the software BinDash (https://github.com/zhaoxiaofei/bindash), comapre with fastANI, Mash, Dashing 2 is not enough (by the way dashing comparison with BinDash is essentially wrong because HyperLogLog cannot be more accurate than MinHash with the same sketch size in theory and in practice, hyperloglog is 100 times less accurate for small Jaccard like 0.01 than MASH/BinDash, see theoretical analysis in book probabilistic data structure and algorithms, Andrii Gakhov 2022). For search, have you compare with skani (https://github.com/bluenote-1577/skani), or GSearch (https://www.biorxiv.org/content/10.1101/2022.10.21.513218v3) even though I believe skani is just an artifact for being fast (ignoring many comparison via quick filtering, not all versus all comparison actually)?

Thanks,

Jianshu

wh-xu commented 5 months ago

Hi Jianshu,

Thanks for these comments. We added the following new benchmarking results:

Bindash
Skani
Dashing 2 with weighted mode

Please check our updated paper on bioRxiv.

Best, Weihong

jianshu93 commented 5 months ago

Hi @wh-xu, I think it is also worth mentioning that FracMinHash (or universal minimizer) is not theoretically guaranteed since the RMSE does not coverage to 0 as MinHash (bottom-k in Mash or one permutation hashing with optimal/faster densification in BinDash 2 both converges to 0 as sketch size increases) when scale factor is less than 1. See discussion many places including the most recent several papers (Sahlin et.al., 2023). The gold standard ANI is blastn/usearch ANI, discussed in the OrthoANI paper (https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.000760), ANIm is inflated for below 90% ANI, see this benchmark paper here: https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/ijsem.0.004124 (figure 1). I do not have big problem with ANI>85% only as in the paper but I think it just worth mentioning in discussion or something. Thanks,

Jianshu

jianshu93 commented 5 months ago

Hi @wh-xu, are you still in UCSD? I am going to do my post doc in UCSD starting from September but I did not know anybody in UCSD. It would be nice if you can help and add me to some Chinese WeChat groups (ID: Jianshu_Zhao) for renting/housing et.al. (Forget about it if you are not one of the Chinese community).

Thank you,

Jianshu

wh-xu commented 5 months ago

Thanks for the suggestions! I will take a look and add them to the next version of the paper.

Our usage of FracMinhash is based on the analysis in: https://www.biorxiv.org/content/10.1101/2022.01.11.475870v4.full

It will be great if you could given some comments or comparison for this? Thanks

wh-xu / Hyper-Gen

comparison with BinDash or GSearch? #3