williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

Missing genes in IRFinder-IR #190

Open Nathan-Lecouvreur opened 2 months ago

Nathan-Lecouvreur commented 2 months ago

Hi,

I have been using IRFinder for a while now and it has been very usefull for the detection of intron retention events in our sequencing data. But recently we were looking at BAM file in IGV and noticed a gene that obviously had an intron retention just by looking at the raw reads on the gene and when we tried to see it's IRratio we couldn't find it in the list.

So I dug a little to find when the gene was removed and I came to the conclusion that it was removed according to the exclude.omnidirectional.bed file.

Here is a view of the exlude bed files, Umap mappability, genes and repeats from IGV around the gene that is absent from the IRFinder-IR files (GABBR1)

image

The region is clearly removed due to the omnidirectional.bed which seems to be linked to the mapabilityExclusion.bed . Then I guess it is linked to a mappability issue. But on the track there isn't any major mappability issue that could explain the masking of the whole region.

Here is an example of another region where both omnidirectional.bed , MapabilityExclusion.bed and the actual mappability tracks almost perfectly.

image

This big masked regions are higly represented on the chromosome 6 masking a lot of genes. So are these discrepencies only due to the fact that mappability is only computed by IRFinder to produce the mask or if there is another reason why such big regions are masked for mappability reasons without any major mappability issue.

Some details on how I used IRFinder :

To sum up I am trying to understand if there is a reason behind these huge masked regions that do not a bad mappability and if there is a way to be more precise in the mappability masking to avoid loosing important genes.

Thank you in advance for the time you will dedicate to my question !

dg520 commented 2 months ago

@Nathan-Lecouvreur
Thanks for investigating the source code and digging to the bottom of it. You're right that the intron of your interest is excluded. The exclusion decision is made from multiple resources, including:

  1. Mapability like you found
  2. Other expressing features such as miRNA
  3. Other black listed component if provided.
  4. IRFinder uses the original intron length minus the total length that belongs to the sum of 1-3 above. The remaining "clean" intronic length must be above a hard-coded threshold in bin/util/IntronExclusion.pl. Basically, >=40bp and >= 70% of the original intron length. Technically, you can try to hack those two values and re-run the IRFinder reference preparation to see whether you can rescue that intron. But a head-up is that this might increase the overall false positives in the results and have a second thought on whether you need to adjust the default p-value adjustment approach if you're also working on differential IR analysis. Biologically, is that intronic pattern true intron retention, or is it due to the expression of other features? That is the question to ask before hacking the code or deciding what new cutoffs should be reasonable.

In case you still have follow-up questions, can you please repost your question to the new IRFinder maintenance site listed on this repository's front page,? This site will no longer be maintained. Thank you!