wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
273 stars 30 forks source link

CoverM reports near 0 reads mapped despite almost all reads seemingly being mapped #165

Closed greenmna closed 1 year ago

greenmna commented 1 year ago

Greetings!

I have been using CoverM to calculate coverage from assemblies of simulated nanopore sequence data. These data were generated using the tools NanoSim and have simulated errors in them and variable abundance of the organisms that make it up.

I have also simulated datasets the same way without errors and have had no issue with CoverM reporting significant read mapping and subsequently giving coverage estimations.

Before using CoverM, I aligned my reads back to the assembly with minimap2 version 2.24. I then convert the output to a BAM file and sort that using samtools using 1.16. I check the % reads mapped and it was at 99.986%.

My question, or I guess plea for help, is why would CoverM think that almost no reads have mapped when the results should suggest otherwise?

The code I use for running CoverM is as follows coverm contig -m metabat -b <sorted-alignment>.bam -p minimap2-ont -t 16 --min-covered-fraction 0 -o <coverage-file>.cov

Any help with potentially identifying where the issue might be in this would be much appreciated. I am not sure if all the information given is enough to figure out the problem (if it is with CoverM), but can provide more if needed.

I will also be including the results of samtools stat on the sorted BAM file that was the input to CoverM

samtools_stat.txt

Thank you and best regards!

Noah

wwood commented 1 year ago

Hello!

I think this is because the metabat output defines some alignment thresholds and then all the reads are failing them? These are described in the output, and can be changed by specifying more parameters on the command line.

That it?

greenmna commented 1 year ago

Thank you for such a fast reply!

You are absolutely correct. After I went back to look at how MetaBAT "binned" all my contigs, it named the file as _tooLowDepth.fa.

I am guessing that the errors in my simulated nanopore data may be such that the prevalence of INDELs is ruining everything since MetaBAT doesn't factor those regions into coverage calculation. I'm betting if I change the --percentIdentity argument to something below 97 I'd get my results to look more like what one would expect based on the mapping results prior. I went back and ran CoverM using the default -m mean and it perfectly matched the reads mapped at 99.98%.

If I may ask one more question, is using either mean, metabat, or any other mode more preferable? Or does it ultimately come down to user preference/datasets?

Thank you very much, once again, for such a rapid replay!

wwood commented 1 year ago

No problem.

Try looking at or just using aviary? https://github.com/rhysnewell/aviary

-------------- Ben Woodcroft Group leader, Centre for Microbiome Research, QUT


From: greenmna @.> Sent: Tuesday, May 16, 2023 7:46:26 AM To: wwood/CoverM @.> Cc: Ben J Woodcroft @.>; Comment @.> Subject: Re: [wwood/CoverM] CoverM reports near 0 reads mapped despite almost all reads seemingly being mapped (Issue #165)

Thank you for such a fast reply!

You are absolutely correct. After I went back to look at how MetaBAT "binned" all my contigs, it named the file as _tooLowDepth.fa.

I am guessing that the errors in my simulated nanopore data may be such that the prevalence of INDELs is ruining everything since MetaBAT doesn't factor those regions into coverage calculation. I'm betting if I change the --percentIdentity argument to something below 97 I'd get my results to look more like what one would expect based on the mapping results prior. I went back and ran CoverM using the default -m mean and it perfectly matched the reads mapped at 99.98%.

If I may ask one more question, is using either mean, metabat, or any other mode more preferable? Or does it ultimately come down to user preference/datasets?

Thank you very much, once again, for such a rapid replay!

― Reply to this email directly, view it on GitHubhttps://github.com/wwood/CoverM/issues/165#issuecomment-1548637550, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAADX5H4FJC7EQRUXQ7QGULXGKP3FANCNFSM6AAAAAAYCOP444. You are receiving this because you commented.Message ID: @.***>

greenmna commented 1 year ago

I will look into it!

Thanks again!

rhysnewell commented 1 year ago

Aviary will give you both long and short read coverage files, but if you are only looking to get the coverages and not run the whole pipeline (and are feeling more experimental), you could also use the ISS-36 branch of rosella to just get coverage results and ignore the binning component for now and take coverages.tsv file it produces. You can give it bam files or raw reads as well. The output will be in the metabat format, so it will be compatible with other binners but shouldn't throw away long read alignments