wwood / galah

More scalable dereplication for metagenome assembled genomes
GNU General Public License v3.0
46 stars 11 forks source link

sketches in pre-cluster for a little bit distant related genomes #14

Open jianshu93 opened 2 years ago

jianshu93 commented 2 years ago

Hello Ben,

If the pre cluster ani defined by finch was used for distant related genomes, say 80% ANI, will the default sketch from mash (1000) enough? Oversketch is not used for galah right, which was designed for metronomes.

Thanks,

Jianshu

wwood commented 2 years ago

Good question. I'm not actually sure. Perhaps safer to avoid pre-clustering entirely? I'll try to work on this, but will be some time I imagine.

jianshu93 commented 2 years ago

I will do that before you figure it out. I was asking because in the FastANI paper, mash distance could only approximate ANI/FastANI when sketch size is larger than 10^4 or 10^5 (table 2), and will lose accuracy below 90% ANI (figure 1 (a)). But as you can see in figure 1 (a), down to 80% ANI, mash is still quite good using sketch size 10^5. So I think if you are using the default mash sketch parameter (1000), there might be some problems for dereplicaiton. We should use at lease 10^4 to have a good correlation. Maybe add a parameter to use user provided sketch and K size in finch.

Thanks,

Jianshu

wwood commented 2 years ago

Ah right - thanks for the tips - helpful.

jianshu93 commented 2 years ago

The solution for now is, in sketch directory, sketch_schem, mod.rs, line:

impl Default for SketchParams { fn default() -> Self { SketchParams::Mash { kmers_to_sketch: 10000, final_size: 1000, no_strict: false, kmer_length: 21, hash_seed: 0, } } }

change default kmer sketch to 10000 instead of 1000 by default. And change the Cargo.toml from "finch = "0.3.*"" to "finch = { path = "../finch"}" after change the mentioned change in the ../finch directory.. if we want to add kmer and sketch option, there will be some coding coming.

Anything you think could be add?

Many thanks,

Jianshu