wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
309 stars 31 forks source link

unmapped reads of abundance output and mapped reads of `samtools flagstat` is quite different #60

Closed juyanmei closed 3 years ago

juyanmei commented 3 years ago

Hi, I used this code to calculate abundance of genome "coverm genome -p bwa-mem --coupled demo.1.fq.gz demo.2.fq.gz --genome-fasta-directory ./dereplicated_genomes/ --bam-file-cache-directory ./ -o output.tsv", the final unmapped reads abundance is 85%, but this code "samtools flagstat coverm-genome.demo.1.fq.gz.bam" showed mapped reads is 92%, I am really confused.

head -2 output.tsv Genome demo.1.fq.gz Relative Abundance (%) unmapped 84.77992

samtools flagstat coverm-genome.demo.1.fq.gz.bam 4818489 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 732129 + 0 supplementary 0 + 0 duplicates 4428318 + 0 mapped (91.90% : N/A) 4086360 + 0 paired in sequencing 2043180 + 0 read1 2043180 + 0 read2 2774868 + 0 properly paired (67.91% : N/A) 3657264 + 0 with itself and mate mapped 38925 + 0 singletons (0.95% : N/A) 821120 + 0 with mate mapped to a different chr 362154 + 0 with mate mapped to a different chr (mapQ>=5)

wwood commented 3 years ago

Hi,

That is rather confusing. Would you be able to transfer the BAM file to me (it will include the reads)?

Also, just to check you are using CoverM 0.6.1?

Thanks.

juyanmei commented 3 years ago

Hi, Thanks for your replay. I am using both 0.6.1 and 0.6.0, it's basically equal. I'll email you the BAM file.

wwood commented 3 years ago

Thanks for sending the files.

Looks like a simple case of min-covered-fraction kicking in, in particular RDPYD19080571_A-RDPYD19080571_A.metaspades.maxbin2_dastools.bin.001_sub has a coverage of 29.412193 but only 0.022369059 (ie. 2.2%) of its genome covered. Lots more mappings are observed when you rerun genome with --min-covered-fraction 0.

The cached BAM file is the cache of all the mappings observed - mappings effectively classed as 'spurious' like those above (and those that fail --min-read-aligned-percent etc.) are not removed from the BAM file after the mapping has taken place. This is helpful so rerunning with different thresholds is fast, but admittedly somewhat confusing. I've added some detail about this is the help in response.

Make sense?

juyanmei commented 3 years ago

Sorry for the late reply. I got it. The difference between unmapped reads of abundance output and mapped reads of samtools flagstat because of the defalut threshold --min-covered-fraction.