Question about coverM settings (not an issue)

patriciatran commented 2 years ago

Hello,

I am trying to follow the methods in this paper. It looks like they used BamM but on the BamM page it says to refer to CoverM instead.

I am trying to find the settings for this section specifically to reproduce them with my dataset:

Thresholds for defining (≥10 Kbp, ≥95% global identity) and detecting (≥75% of the contig length covered ≥1x by reads recruited at ≥90% average nucleotide identity) viral populations (vOTUs) were implemented in accordance with benchmarking and community consensus recommendations

Would the settings under coverm contig be correct? ≥10 Kbp: --min-read-aligned-length 10000 ≥95% global identity: --min-read-percent-identity 95 ≥75% of the contig length covered: --min-read-aligned-percent 75 ≥1x by reads recruited: ?? ≥90% average nucleotide identity: ??

I was wondering whether the appropriate steps would be to run coverm cluster -ani 90 first and then do coverm contig.

Thanks for letting me know if you have any tips.

Best,

Patricia

wwood commented 2 years ago

I think this has to be done in a few steps

Dereplicate at 95%. This is currently a bit annoying with CoverM because it only clusters genomes. You have to specify each contig as being a separate genomes.
Find contigs ≥75% of the contig length covered ≥1x by reads recruited: ≥90% average nucleotide identity. Use --min-covered-fraction 75 --min-read-percent-identity 90. Maybe also use --min-read-percent-identity 95, not 100% clear.
Do the final mapping with --min-read-percent-identity 95 against the contigs that pass step 2.

HTH, ben

smdabdoub commented 1 year ago

The dereplication/clustering step can be done with the procedure recommended by the CheckV developers (under the "Rapid genome clustering based on pairwise ANI" section). They recommend clustering at 95% ANI & 85% AF.

wwood / CoverM

Question about coverM settings (not an issue) #114