Open jianshu93 opened 2 years ago
Good question. I'm not actually sure. Perhaps safer to avoid pre-clustering entirely? I'll try to work on this, but will be some time I imagine.
I will do that before you figure it out. I was asking because in the FastANI paper, mash distance could only approximate ANI/FastANI when sketch size is larger than 10^4 or 10^5 (table 2), and will lose accuracy below 90% ANI (figure 1 (a)). But as you can see in figure 1 (a), down to 80% ANI, mash is still quite good using sketch size 10^5. So I think if you are using the default mash sketch parameter (1000), there might be some problems for dereplicaiton. We should use at lease 10^4 to have a good correlation. Maybe add a parameter to use user provided sketch and K size in finch.
Thanks,
Jianshu
Ah right - thanks for the tips - helpful.
The solution for now is, in sketch directory, sketch_schem, mod.rs, line:
impl Default for SketchParams { fn default() -> Self { SketchParams::Mash { kmers_to_sketch: 10000, final_size: 1000, no_strict: false, kmer_length: 21, hash_seed: 0, } } }
change default kmer sketch to 10000 instead of 1000 by default. And change the Cargo.toml from "finch = "0.3.*"" to "finch = { path = "../finch"}" after change the mentioned change in the ../finch directory.. if we want to add kmer and sketch option, there will be some coding coming.
Anything you think could be add?
Many thanks,
Jianshu
Hello Ben,
If the pre cluster ani defined by finch was used for distant related genomes, say 80% ANI, will the default sketch from mash (1000) enough? Oversketch is not used for galah right, which was designed for metronomes.
Thanks,
Jianshu