wwood / CoverM

Read coverage calculator for metagenomics
GNU General Public License v3.0
273 stars 30 forks source link

Feature request: Handle secondary alignments and new release #164

Closed jakobnissen closed 1 year ago

jakobnissen commented 1 year ago

Dear CoverM developers,

We're happy users of CoverM, which we use in our tool Vamb. For our application, we need to estimate abundance of contigs from BAM files, where the reference mapped against may contain multiple near-identical contigs (from closely related strains). Hence, it's important for us that CoverM is able to handle situations when a read maps to multiple contigs as they inevitably will.

There have been several issues regarding secondary or supplementary alignments on this repo (#152, #93, #69), but IMO they are not precisely what I'm asking for in this issue here. I'm interested in secondary (flags 0x100) as opposed to supplementary (flags 0x900) alignments. Presumably, if a read maps to 10 contigs above the mapping threshold, we can't accurately distinguish which of the 10 contigs the read really belongs to. In that case, it may be more prudent to either assign it randomly, or, preferably, to add a depth of 0.1 to each covered base (I think? Not sure).

I realize this might be difficult to handle in practice - in order to memory efficiently count coverages across the reference sequences individual base pairs, the BAM file must be sorted by coordinate. However, in order to determine how many alignments a read has, it needs to be sorted by read. It might be necessary to read the file in two passes to do this: Once to get the number of alignments for each read that pass the filtering step, after which the result can be stored in a Vec<u8> for number of reads and a bitvector for whether the alignment passed the filtering criteria.

What are your thoughts on this? Is this feature within the scope of CoverM, or would be it better to go elsewhere for contig abundance estimation under ambiguous read mapping?

wwood commented 1 year ago

Hi,

Thanks for the kind words, glad it is useful.

Without answering your question fully, I'm wondering whether this is a practical or just theoretical issue. The BAM format states that exactly 1 mapping for each (mapped) read has to marked as primary. So mapping algorithms tend to use your option (A) there i.e. assign one as a primary alignment.

Obviously non-primary alignments hold information, but are you seeing some specific case where it is definitely the root cause?

Thanks, ben

jakobnissen commented 1 year ago

Dear @wwood

We have now investigated more closely, and it appears that at least when using minimap2 to map short reads, there is no material difference in the relative abundances between contigs reported by CoverM when including up to 20 secondary hits using CoverM, and having CoverM ignore all the non-primary hits, respectively. The ratio remains more or less the same despite the fact that the presence of secondary reads cause a large difference in the absolute values of the CoverM output abundances. This is probably because, in the face of multiple equally good alignments, minimap2 assigns the primary alignment randomly. Hence, when contigs are long and depth is high, the large number of reads per contig means that the sampling effect from minimap2 randomly selecting a subset of alignments as being primary alignments is small, and the computed depth is proportial to the true depth.

Notably, this might not be the case for other mappers than minimap2, nor necessarily with contigs created by fewer, longer reads, where the sampling effect is significant. It may also not necessarily be the case when you have multiple reads of differing identity - in our case, we had lots of hits with 100% identity. Nonetheless, I'm closing the issue.

wwood commented 1 year ago

Thanks for having a deep look @jakobnissen

When you say no material difference, do you mean specifically for VAMB?

Definitely think that secondary alignments have their uses - would be good to include them for a better relative_abundance calculation for instance.

jakobnissen commented 1 year ago

Yes, to go a little more in detail, Vamb (and other binners) derive much of their signal from co-abundance. I wanted to investigate how multimapping reads might throw off co-abundance e.g. if aligners aggregated multimapping reads to a few select references, skewing the abundance signal. We can compute the co-abundance signal by measuring Pearson correlation of abundances across all samples for two contigs, and then we find that the presence or absence of secondary alignments in the BAM files makes little (essentially, no) difference to the co-abundance signal, when the abundances are computed by CoverM. To trick CoverM to not ignore secondary alignments, we've simply unset the "secondary alignment" flag for all the secondary alignments in the BAM file.

This is despite the actual abundance values changing quite a bit. So what I think is happening is that minimap2, by randomly assigning one of multiple alignments to be primary if they have the same score, and CoverM only measuring the primary alignments, it works effectively like a random sub-sampling of the alignments.

ilnamkang commented 1 year ago

Dear @jakobnissen

I'd also like to trick CoverM not to ignore secondary alignments.

Would you let me know how I can unset the "secondary alignment" flag in the BAM file?

Thanks.