Open jianshu93 opened 1 year ago
Hello Ben,
It has been a long time since my last message. Just want to make sure that the suggestions above make sense to you. Bindash is by far the fastest and also the most accurate MinHash like algorithm, better than hyperloglog due to smaller variance. Calling bindash instead of dashing should be very easy because the output is the Mash index, 1- index will then be ANI for clustering.
Thanks,
Jianshu
Hello Ben,
I was investigating MinHash algorithm heavily in the past several months. In terms of simple minhash, that is to estimate jaccard in traditional manner, b-bit One Permutation MinHash with optimal densification (https://dl.acm.org/doi/abs/10.1145/1772690.1772759, https://proceedings.neurips.cc/paper/2012/file/eaa32c96f620053cf442ad32258076b9-Paper.pdf ,http://proceedings.mlr.press/v70/shrivastava17a.html) represents the most space and time efficient algorithm among all others, including hyperloglog. It was implemented in the bindash software (https://academic.oup.com/bioinformatics/article/35/4/671/5058094), since Xiaofei left academia, it was not further developed as dashing was (dashing 2 for example). However, after several experiments, e.g. all versus all distance computation for all NCBI genomes, bindash is the fastest (I use kmer 16 and sketch size 12000 to have 95% ANI level accuracy) I have ever seen, about 2 times faster than dashing. It supports only nucleotide but not amino acid as dashing and Mash do. I would suggest do not use finch because it is memory inefficient for large number of genomes. What do you think.
Thanks,
Jianshu