wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
309 stars 31 forks source link

Trimmed Mean larger than Mean #15

Closed Thexiyang closed 5 years ago

Thexiyang commented 5 years ago

Thanks for this very nice tool. I have a question regarding the output. In my results, there is one bin showing quite different values between the Mean and Trimmed Mean (see below). In my understanding, Tirmmed Mean values should be less than Mean values. Most other bin follow this rule with only a few exceptions. Is it right? Of these two parameters, which one more accurately represenst the relative abundance? Thanks!

Genome | Mean | Relative Abundance (%) | Trimmed Mean Bin1 | 106.6692 | 2.686203 | 2030.007

wwood commented 5 years ago

Hi, possible that trimmed mean > mean, because both the most covered and least covered positions are removed before recalculating the mean from the middle.

That does seem like a big difference though. What do you see when you run it with -m coverage_histogram ?

Thexiyang commented 5 years ago

Here it is. Any idea here after you look at it? Thanks! Bin1_coverage_histogram.txt

wwood commented 5 years ago

Hi again, Thanks for that - calculating the mean and trimmed mean manually in R gives me a different answer, suggesting there might be a bug. Are you able to send the BAM file that you used here please?

Also, how exactly did you run CoverM i.e. what command line arguments? To make reproducing easier.

Thanks.

Thexiyang commented 5 years ago

The file is too large to upload. I plan to send you an email with google-drive link to this email address: b.woodcroft near uq.edu.au. Is the address right? Thanks for checking!

wwood commented 5 years ago

Yes, that address will work. Thanks.

wwood commented 5 years ago

Hi, this turned out to be an unsigned integer overflow bug - I'll push a fix in a second to the master branch. For reference, the output of your command for me is now

$ cargo run --release -- genome  --bam-files combined.S3_1stq_filtered.bam --genome-fasta-directory ./ -x fa -t 12 -m mean relative_abundance trimmed_mean > S3_coverm
   Compiling coverm v0.3.0 (/home/ben/git/coverm-ben2)
    Finished release [optimized] target(s) in 12.04s
     Running `/home/ben/git/coverm-ben2/target/release/coverm genome --bam-files combined.S3_1stq_filtered.bam --genome-fasta-directory ./ -x fa -t 12 -m mean relative_abundance trimmed_mean`
[2019-11-14T23:28:56Z INFO  coverm] CoverM version 0.3.0
[2019-11-14T23:28:56Z INFO  coverm] Using min-covered-fraction 10%
[2019-11-14T23:28:56Z INFO  coverm] Not using directory entry './small.bam' as a genome FASTA file, as it does not end with the extension 'fa'
[2019-11-14T23:28:56Z INFO  coverm] Not using directory entry './combined.S3_1stq_filtered.bam.bai' as a genome FASTA file, as it does not end with the extension 'fa'
[2019-11-14T23:28:56Z INFO  coverm] Not using directory entry './combined.S3_1stq_filtered.bam' as a genome FASTA file, as it does not end with the extension 'fa'
[2019-11-14T23:28:56Z INFO  coverm] Not using directory entry './S3_coverm' as a genome FASTA file
[2019-11-14T23:28:56Z INFO  coverm] Not using directory entry './old.bai' as a genome FASTA file, as it does not end with the extension 'fa'
[2019-11-14T23:28:56Z INFO  coverm] Reading contig names for 1 genomes ..
[2019-11-14T23:28:56Z INFO  coverm::genome] Of 151703 reference IDs, 117 were assigned to a genome and 151586 were not
[2019-11-14T23:30:32Z INFO  coverm::genome] In sample 'combined.S3_1stq_filtered', found 31465837 reads mapped out of 79089524 total (39.79%)
Genome  combined.S3_1stq_filtered Mean  combined.S3_1stq_filtered Relative Abundance (%)    combined.S3_1stq_filtered Trimmed Mean
unmapped    NA  60.214912   NA
S3_bin4 2026.2925   39.78509    2030.0072

Thanks for the clear bug report.

Thexiyang commented 5 years ago

Thanks a lot! Would it possible to also update the bioconda version to 0.3.1?

wwood commented 5 years ago

https://github.com/bioconda/bioconda-recipes/pull/18661