williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

missing intron events? #48

Closed hliang closed 6 years ago

hliang commented 6 years ago

Hi, I'm studying the intron retention ratio of a complex gene. It has 40+ transcript isoforms in the ensembl annotation (hg38). IRFinder finished without any error. In the result (IRFinder-IR-dir.txt), some of the introns are present, but a few are missing.

I think IRFinder makes the decision which introns to keep, when it builds the reference. Maybe IRFinder gets confused by so many isoforms. Would you explain why some introns are excluded?

Any suggestion if I really want to get the IR ratio of some missing introns? I'm thinking of building a gtf with a down-sized gtf file with less isoforms, maybe that makes IRFinder less confused. But I'm not sure which isoforms to include, because I don't know how IRFinder picks the good introns.

Thank you.

dg520 commented 6 years ago

Hi @hliang,

The intron exclusion step is carried out by IRFinder/bin/util/Build-BED-refs.sh. Introns are passed through a sequential cutoffs of criteria, for example, if the entire intron is covered by an exon annotation or if the overall mapability across the entire intronic region is too low. In order to trace at which stage your introns of interest are excluded, do the following: 1) Comment out the second last line in the script, like: # rm tmp.50 tmp-dir.IntronCover.bed tmp-nd.IntronCover.bed tmp.read-continues tmp.candidate.introns tmp.exons.exclude tmp.all.annotations tmp.reversed.genes tmp.ROI.rRNA.bed tmp.ROI.combined.bed. 2) Then re-run this script from top to bottom (some basic environment variables setting required) and you should get all the intermediate files during the reference building step. 3) Please investigate each of these files with exclude in its file name, to see which one (can be multiple) contain your introns of interest. 4) You can look back into the script IRFinder/bin/util/Build-BED-refs.sh to figure out which cutoff/stage actually generates the corresponding files. If you are not sure about this or want to know more about the cutoffs leading to your intron missing, please let me know what your found at Step 3 so that we can further nail them down.

Best, Dadi

hliang commented 6 years ago

Hi Dadi,

I ran IRFinder/bin/util/Build-BED-refs.sh as you suggested, and checked the output files. I put them in IGV for visualization. Input transcripts.gtf looks like: transcripts gtf exclude.directional.bed and tmp.exons.exclude look like: exclude The exclude.omnidirectional.bed is empty for these region. I thought introns of interest were excluded because the adjacent exons were excluded, but soon realized that was not the case when I checked other genes (with simpler gene model structure). simple_gene_dsg2 This gene's introns actually show up in IRFinder results IRFinder-IR-dir.txt.

exclude.directional.bed and tmp.exons.exclude seem to always have exons of the gene, but the adjacent introns are not excluded. Is there other files/parameters to check?

I appreciate if you could help me better understand the missing introns.

-Liang

dg520 commented 6 years ago

Hi @hliang ,

Nice update! Now it's much clearer what happened. First of all, the files with exclude mark where the genomic region should not be considered as "bona fide" introns. So exonic regions are surely there in the files. Other criteria such as low-mappability regions are added to the exclusion category as well (so will those regions defined in the black list that is passed to IRFinder). As we can obviously see from your IGV graph, the two introns of your interest have large chunks that would be masked out by IRFinder. The remaining "good" part is around 50% and 60% of the original length respectively.

In the script IRFinder/bin/util/IntronExclusion.pl, which is called by IRFinder/bin/util/Build-BED-refs.sh, there is a default cutoff setting for how long the minimal remaining part should be to let the intron to be kept. It's defined as minimal length of 40 bp and minimal ratio of 70% of the original intronic length, in the line: if ($newlen > 40 && ($newlen/$len) >= 0.7) {

Here is the solution specific to your case: 1) First check if your introns of interest are contained in the file tmp.candidate.introns. They are unlikely to be missing there, otherwise we have to check the GTF/GFF file to begin with. 2) You can change the above cutoffs to fit your intron of interest and re-run IRFinder/bin/util/Build-BED-refs.sh. Now you are supposed to find them in ref-cover.bed and the 4th column is NOT skip .

Best, Dadi

hliang commented 6 years ago

Dadi, After I changed the threshold of ($newlen/$len) in IRFinder/bin/util/IntronExclusion.pl from 0.7 to 0.4, and reran Build-BED-refs.sh, the introns are now present in ref-cover.bed. See the IGV screenshot below, the introns of interest are marked by arrows. I include ref-cover.bed from default setting (upper panel) and modified setting (lower panel) for comparison. ref-cover bed Thanks for your great help!

dg520 commented 6 years ago

No problem. I hope IRFinder is useful to your project. I'll mark this discussion as closed.