wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
275 stars 30 forks source link

No found reference sequences when headers look the same #86

Closed shafferm closed 2 years ago

shafferm commented 2 years ago

Hello @wwood,

I am using coverm (version coverm 0.6.1 from bioconda) to get average coverage for some metagenome assemblies to deposit to NCBI. The command I am running is coverm genome -v -m mean -t 15 --bam-files /home/projects-wrighton/NIH_Salmonella/Salmonella/Metagenomes/Megahit/LM_megahit/LM.mapped.sorted.bam -f /home/projects-wrighton/NIH_Salmonella/Salmonella/Metagenomes/Megahit/LM_megahit/final.contigs.2500.fa. When I run this error:

[2021-08-30T21:24:43Z INFO  coverm::genome] Of 18649 reference IDs, 0 were assigned to a genome and 18649 were not
[2021-08-30T21:24:43Z ERROR coverm::genome] Error: There are no found reference sequences that are a part of a genome

The mappings were generated from this reference using bbmap. To dig into this I grabbed the headers from the fasta file using grep (uploaded here: LM_headers.txt) and the headers from the bam file using idxstats (uploaded here: LM.mapped.sorted.idxstats.txt). When I look between them it looks like the headers match. Am I missing something here? Is something in the headers making the matching break?

Thanks, Mike

wwood commented 2 years ago

Hi Mike,

Strange. Thanks for the report. Are you able to send a cut down BAM file and cut down fasta file so I can test? I suspect it might have something to do with the spaces in the names, but not sure yet. Maybe just email me. Thanks.

shafferm commented 2 years ago

Hey Ben,

Here is a link to the bam file: https://www.dropbox.com/s/tbuwnzv5ywb7jtn/LM.mapped.sorted.subsample001.bam?dl=0 And here is the fasta file: https://www.dropbox.com/s/qjen3gr04oclcuh/LM.subsample001.contigs.fa.gz?dl=0

Happy to move this to email or share full files somewhere else if that is easier. I subsampled to 1% of reads for the bam file and in the fasta file I filtered out contigs with no hits in the subsampled bam. So there are headers in the bam that aren't in the fasta. Hope that isn't a problem.

These are contigs generated from megahit which always has those spaces. I didn't run into this problem with contigs from idba_ud which don't have spaces.

wwood commented 2 years ago

Hey Mike, it is the end of the week and I haven't gotten to this. A workaround might be putting the sequence names as input to --genome-definition rather than using -f?

shafferm commented 2 years ago

No problem not getting to this yet. I tried using --genome-definition but you can't use -r and --bam-files at the same time and I'd like to be able to use those mappings.

wwood commented 2 years ago

Hmm, sorry, confused. You can never use -r and --bam-files together at the same time, because coverm either does the mapping or uses a bam as input, not both. I'm not understanding something?

wwood commented 2 years ago

Hey sorry about the above comment, not sure what I was thinking. Will get back to you.

wwood commented 2 years ago

Hi Mike,

I "fixed" this by adding a new flag --use-full-contig-names to the dev branch:

$ cargo run -- genome -b LM.mapped.sorted.subsample001.bam --genome-fasta-files LM.subsample001.contigs.fa --use-full-contig-names
   Compiling coverm v0.6.1 (/home/ben/ownCloud/git/coverm)
    Finished dev [unoptimized + debuginfo] target(s) in 3.80s
     Running `/home/ben/ownCloud/git/coverm/target/debug/coverm genome -b LM.mapped.sorted.subsample001.bam --genome-fasta-files LM.subsample001.contigs.fa --use-full-contig-names`
[2021-09-09T06:52:53Z INFO  coverm] CoverM version 0.6.1
[2021-09-09T06:52:53Z INFO  coverm] Using min-covered-fraction 10%
[2021-09-09T06:52:53Z INFO  coverm::genome] Of 18649 reference IDs, 11050 were assigned to a genome and 7599 were not
[2021-09-09T06:53:02Z INFO  coverm::genome] In sample 'LM.mapped.sorted.subsample001', found 375265 reads mapped out of 441326 total (85.03%)
Genome  LM.mapped.sorted.subsample001 Relative Abundance (%)
unmapped    14.968753
LM.subsample001.contigs 85.03125

Hope that settles it? Thanks for the report. I hope to release a new version soon, but not entirely sure when - let me know if you need a static compile in the meantime.