zavolanlab / mirflowz

Snakemake workflow for the mapping and quantification of miRNAs and isomiRs from miRNA-Seq libraries.
MIT License
6 stars 0 forks source link

Filtering multimapper alignments to minimize indels #25

Closed deliaBlue closed 1 year ago

deliaBlue commented 1 year ago

In isomiRs, mismatches are much more likely than indels. For this reason, after mapping the reads, if there are multiple alignments of the same read with the same edit distance, we will keep only the ones with the lower number of indels.

MIND that this does not imply that there will be no indels at all as we may see a read mapping to different regions with only inidels in all of them. In this situation, we will keep the one with the smallest edit distance.

uniqueg commented 1 year ago

Additional details from: https://github.com/zavolanlab/mirflowz/issues/7#issuecomment-1466952252


After mapping, we would indeed first need to have one pass over the alignments and filter them according to the rules you describe, i.e., keep from "multimappers" (same read, multiple alignments with same edit distance) only those alignments with the lowest number of indels (so indels could still remain!). This will address a major source of ambiguities described in #6, based on the assumption (as you correctly said) that mismatches are much more likely in isomiRs than indels.

deliaBlue commented 1 year ago

This issue must be assessed in the map subworkflow after the uncollapse_reads rule. There the file uncollapsedMappings.sam will be scanned for multimappers. It is important to note that this file has been already modified by deleting duplicate entries (i.e. same QNAME. FLAG, RNAME. POS and CIGAR), keeping records with the smallest distance and same QNAME while keeping multimappers.

Whenever a multimapper is found, the CIGAR string will be compared to select the entry with less indels - favoring Ms over Is and Ds. It can occur that a multimapper has two entries with the same amount of indels and edit distance; in this case, both will be kept.

uniqueg commented 1 year ago

Why you wanna do it after uncollapsing?

deliaBlue commented 1 year ago

I thought of doing it after uncollapsing because is the last processing step of the mappings, but it is maybe more reasonable to do it before as this way, identical mappings are processed as one instead of multiple times.

uniqueg commented 1 year ago

Exactly :)

deliaBlue commented 1 year ago

Suppose there's a sequence with the following mapping results:

21-1    0   19  128118  255 9M1D12M *   0   0   CTGACATCAGCGATTCTCCTG   *   MD:Z:3G1T3^A12  NH:i:6  NM:i:3  XA:Z:Q  XI:i:3
21-2    0   19  148680  255 4M1I2M1D2M1D12M *   0   0   CTGACATCAGCGATTCTCCTG   *   MD:Z:6^T2^A12   NH:i:6  NM:i:3  XA:Z:Q  XI:i:1
21-3    0   19  173776  255 4M1I3M1D13M *   0   0   CTGACATCAGCGATTCTCCTG   *   MD:Z:1G5^T13    NH:i:6  NM:i:3  XA:Z:Q  XI:i:2
21-4    16  19  264953  255 13M1D8M *   0   0   CAGGAGAATCGCTGATGTCAG   *   MD:Z:13^T2A0C4  NH:i:6  NM:i:3  XA:Z:Q  XI:i:4
21-5    16  19  272523  255 13M1D2M1I3M1D2M *   0   0   CAGGAGAATCGCTGATGTCAG   *   MD:Z:13^T5^C2   NH:i:6  NM:i:3  XA:Z:Q  XI:i:5
21-6    0   19  442347  255 21M *   0   0   CTGACATCAGCGATTCTCCTG   *   MD:Z:3G0T1C14   NH:i:6  NM:i:3  XA:Z:Q  XI:i:0

The count goes as follows:

seq M D/I
21.1 21 1
21.2 20 3
21.3 20 2
21.4 21 1
21.5 20 3
21.6 21 0

How many alignments should be removed?

We could drop 5 of them and only keep the one with 21M and 0 D/I or allow for one D/I keeping 21.1, 21.4 and 21.6. We could also decide to keep half of the multimappers (in this case, to keep 3 out of 6) and in case of having an odd number of multimappers lower-round the partition (i.e if having 5, keeping only the best 2).

I personally believe that dropping all except for one removes a lot of potential data but I could be wrong. In any case, it would be nice to define the number of multimappers to keep/drop in cases like that either from the amount of alignments we have (i.e keep half of them), by not allowing more than a certain number of indels (i.e no more than one) or a combination of both.

Also, as we do not know which are mapping within a pre-miR loci, would it be better to handle multimappers after the intersection is made so we do not discard possible isomiRs while keeping alignments that do not intersect any pre-miR loci? (Maybe I'm confusing things with this question but I think that anyways alignments outside pre-miRs loci do not count nor fraction the final results while multimappers actually mapping a pre-miR fraction the total count)

uniqueg commented 1 year ago

Any given read (that is not spurious) comes from exactly one genomic location, no more, no less. When we speak of "multimappers", what we mean to say is: "Given all the information we have, we can't say with confidence that this read comes from this location rather than from that location". More concretely, what we typically say is "These alignments have the same edit distance, so the probability that the read originates from any one of them is the same as that it originates from any of the other ones".

But in this view, we are solely basing our information on edit distance alone, assuming that an alignment with edit distance $n$ is infinitely more likely to represent the real origin of the read than any alignment with an edit distance $>n$. But in reality, the probability that an alignment with an edit distance $<n$ is of course not zero, and the probability that a read originates from the locus (or one of multiple loci) corresponding to an alignment (or alignments) with edit distance $n$ is not 1. But since we have no good way of estimating the actual probabilities and since we assume that the probability of alignments with edit distances $<n$ to point us to the correct loci is small compared to those with edit distances of $n$, we make the simplification of completely disregarding the latter, knowing that while in a few cases read assignments to loci will be wrong, but in most cases they will be right. In return, we gain the advantage of the assignment becoming simpler, without us having to assign probabilities to alignments with bigger edit distances, which given the lack of precise information to inform such probabilities, would almost certainly also be wrong in some cases - and without us being able to tell how wrong either of these approaches may be.

Note that this is the case for every read, "multimapper" or not, because we set up the aligners such that only the best alignments (by edit distance are reported). For example, a read may match to a certain genomic locus perfectly, but there may be five alignments to other genomic loci with an edit distance of one. But we will not know that, because our aligner wouldn't report these, and so we'd be comfortable in our "knowledge" that we're not dealing with a "multimapper" and happily assign that one locus to our read. Nevertheless, it is well possible that the actual origin of the read is one of those five other genomic loci, or even one that matches even less well with the read sequence, for example because of sequencing errors (which are frequent enough).

Again, we live with that simplification because in most cases, our assignment will be correct. And for those remaining alignments with the same best edit distance, we apply fractional counting.

Now, any approach to actually improve on this simplification requires additional information or at least informed assumptions. And this is exactly what we are trying to do here: We know from literature that isomiRs are more likely to represent shifted/extended or truncated versions of canonical miRs, but not really species with insertions or deletions. There may be cases of the latter, but they will be rare compared to the former modifications. How much more rare? We don't know for sure. But given that we are not sure, it also doesn't make sense to put a number on that ratio and use it to assign probabilities to alignments representing the real thing.

From that perspective, I don't think that keeping a configurable number $i$ alignments for each multimapper is not informed by evidence. By settingg this to 2, for example, you would be sure to throw half a count for the real origin away, every single time. I think it's better to say in our documentation that we assume indels to be rare in isomiRs, and therefore you'll only find them in the results if an indel-containing alignment is better (by edit distance) than any non-indel-containing alignment. I think that is a simple rule that people understand and can work into their interpretations, e.g., when designing experiments to study individual isomiRs that were identified as being common through our workflow.

Filtering after intersecting alignments with annotations is a better idea, as we could argue that given that we are expecting miRNA-seq libraries, miRNAs are more likely to be the true origins of reads compared to non-miRNA-containing loci. However, again we would do this only for one type of filtering (indel), but not the other (edit distance). Also, the quality of miRNA-seq libraries is variable, and we may well have large proportions of reads in them that do not originate from miRNAs. So how to account for this without introducing an unknown bias towards counting every possible read as originating from miRNAs, even if they do not? This is especially problematic if you compare multiple samples and each of which has a different level of contamination with non-miRNA reads. Again, I think the better solution is to keep it simple, with as little assumptions as possible to yield stable results with inaccuracies that we deem small enough to still be useful.

Now, if you were interested in basing the counting on a probabilistic model that includes multiple parameters that you may be able to estimate through some clever experiments that you design, that would be a totally valid approach that would likely result in an increased accuracy of the pipeline.

However, this would then rather become a project of the volume to write a PhD thesis about (or at least a good chunk of it). Also, to assess whether it's worth the costs and efforts, you would first need to try to answer these questions: How important are isomiRs? How important are isomiRs with indels/deletions? How important are other miRNA species that we may underrepresent (or overrepresent) through our pipeline? And how realistic is it that your approach would significantly improve upon their quantification? In other words, is the expected impact worth the trouble?

I'd say it's hard to say, and even harder, if we don't even have a working workflow, however flawed it may be in some aspects, that we could use as a baseline to measure potential improvements against.

So very long story short: keep only those best alignments with the minimum number of indels, and if there is still more than one, partially count towards the loci represented by those alignments. So in your example, keep only 21.6 with zero indels.

deliaBlue commented 1 year ago

When filtering for mulimappers only keep one, copy that!

Regarding when to perform this filtering, I cannot see why will we be missing the edit distance filtering; the idea is to keep all the filtering processes as they are now (so at the end we will still have the ones with the best edit distance) but before applying this last filter intersect the alignments. For instance, with the library I am working on, there are 4 alignments with multimappers (including the one I provide as an example) and non of them appear in the intersection file. Won't it be more efficient then, to filter after intersecting and before the tabulation?

Of course I understand the bias on doing so but on the other hand, for huge libraries, not having to scan a huge file to filter alignments that we won't be even count at the end isn't better? Maybe I'm wrong, just pointing this out as I think that the trade off is not that bad. Anyways, by the moment I will filter before uncollapsing the reads and if you finally decide to make the filtering after the intersection I will move it.

uniqueg commented 1 year ago

It seems you didn't understand me. It doesn't have anything to do with efficiency. It's about the assumption that a read is somehow a priori more likely to originate from a miRNA than from a non-miRNA loci. If that were always the case, then we wouldn't even need to map reads to the genome/transcriptome, but rather map against the miRNAs only. But our results would be waaay different, and I'm pretty sure they would be less accurate, with considerable overestimation of miRNA counts. This is also why we did not filter for "best" alignments after intersection either.

So yes, please do the filtering before uncollapsing and counting.