pay attention to skani accuracy below 85% ANI

jianshu93 commented 5 months ago

Dear CoverM team,

I remember I suggested skani some time ago, and glad to see you included it in the newest version. However, you might notice inaccuracy estimations below 85% ANI. While with the newest version of fastANI, it is almost as accurate as blastn-based ANI. See my blog here: https://jianshu93.github.io/blog/ANI-calculator/

I would suggest only shift to skani for above 90% ANI while using fastANI for below 90% clustering threshold by default, but not allow users to choose fastANI or skani, for example, if I want to dereplicate at 85% ANI to choose genus level representatives, clearly fastANI is a better option than skani. Does it make sense? And if there is a paper on CoverM in preparation, I would be happy to contribute and provide those benchmark results that I have been studying.

Jianshu

wwood commented 5 months ago

Thanks @jianshu93. I suppose you would also suggest requiring FastANI 1.34 and using the -correct flag?

@AroneyS it seems the docs are wrong - https://wwood.github.io/CoverM/coverm-genome.html#dereplication-genome-clustering still talks about FastANI - can you make sure skani is being used by default and the docs reflect this please?

jianshu93 commented 5 months ago

Hello @wwood,

Yes I would suggest so since the corrected flag is more close to the actual alignment-based ANI. We will update the bioconda channel for the newest --correct option soon.

Jianshu

jianshu93 commented 4 months ago

hello @wwood pre-clustering default is now skani, which is very dangerous because below 82% ANI skani output is 0. I would suggest use finch version of minhash (essential the same with Mash without over-sketching) , above 90% ANI, skani is as good as fastANI, so I think it is ok to use. For pre-cluster, I would suggest use BinDash (https://github.com/zhaoxiaofei/bindash), I am developing bindash version 2 with Xiaofei. Bindash is as accurate as Mash but is 100 to 1000 times faster than Mash, dashing finch due to the theoretical breakthrough called B-bit one permutation MinHash with optimal/faster densification. BinDash can also be easily installed via bioconda. Theoretically, dashing has the largest variance, then Mash, BinDash has the smallest and it has the amazing property called locality sensitive hashing, neither above has this property. This idea is also implemented in my software called gsearch, for search and classification of genomes (will be published soon). See Gsearch here(https://gitlab.com/Jianshu_Zhao/gsearch), can be installed via bioconda. Let me know you what to also include gsearch into coverm, to classify genomes with extreme speed (almost in seconds).

Thanks,

Jianshu

wwood / CoverM

pay attention to skani accuracy below 85% ANI #199