Closed hliang closed 6 years ago
Hi @hliang,
The intron exclusion step is carried out by IRFinder/bin/util/Build-BED-refs.sh
. Introns are passed through a sequential cutoffs of criteria, for example, if the entire intron is covered by an exon annotation or if the overall mapability across the entire intronic region is too low. In order to trace at which stage your introns of interest are excluded, do the following:
1) Comment out the second last line in the script, like:
# rm tmp.50 tmp-dir.IntronCover.bed tmp-nd.IntronCover.bed tmp.read-continues tmp.candidate.introns tmp.exons.exclude tmp.all.annotations tmp.reversed.genes tmp.ROI.rRNA.bed tmp.ROI.combined.bed
.
2) Then re-run this script from top to bottom (some basic environment variables setting required) and you should get all the intermediate files during the reference building step.
3) Please investigate each of these files with exclude
in its file name, to see which one (can be multiple) contain your introns of interest.
4) You can look back into the script IRFinder/bin/util/Build-BED-refs.sh
to figure out which cutoff/stage actually generates the corresponding files. If you are not sure about this or want to know more about the cutoffs leading to your intron missing, please let me know what your found at Step 3 so that we can further nail them down.
Best, Dadi
Hi Dadi,
I ran IRFinder/bin/util/Build-BED-refs.sh
as you suggested, and checked the output files. I put them in IGV for visualization.
Input transcripts.gtf
looks like:
exclude.directional.bed
and tmp.exons.exclude
look like:
The exclude.omnidirectional.bed
is empty for these region.
I thought introns of interest were excluded because the adjacent exons were excluded, but soon realized that was not the case when I checked other genes (with simpler gene model structure).
This gene's introns actually show up in IRFinder results IRFinder-IR-dir.txt
.
exclude.directional.bed
and tmp.exons.exclude
seem to always have exons of the gene, but the adjacent introns are not excluded. Is there other files/parameters to check?
I appreciate if you could help me better understand the missing introns.
-Liang
Hi @hliang ,
Nice update! Now it's much clearer what happened. First of all, the files with exclude
mark where the genomic region should not be considered as "bona fide" introns. So exonic regions are surely there in the files. Other criteria such as low-mappability regions are added to the exclusion category as well (so will those regions defined in the black list that is passed to IRFinder). As we can obviously see from your IGV graph, the two introns of your interest have large chunks that would be masked out by IRFinder. The remaining "good" part is around 50% and 60% of the original length respectively.
In the script IRFinder/bin/util/IntronExclusion.pl
, which is called by IRFinder/bin/util/Build-BED-refs.sh
, there is a default cutoff setting for how long the minimal remaining part should be to let the intron to be kept. It's defined as minimal length of 40 bp and minimal ratio of 70% of the original intronic length, in the line:
if ($newlen > 40 && ($newlen/$len) >= 0.7) {
Here is the solution specific to your case:
1) First check if your introns of interest are contained in the file tmp.candidate.introns
. They are unlikely to be missing there, otherwise we have to check the GTF/GFF file to begin with.
2) You can change the above cutoffs to fit your intron of interest and re-run IRFinder/bin/util/Build-BED-refs.sh
. Now you are supposed to find them in ref-cover.bed
and the 4th column is NOT skip
.
Best, Dadi
Dadi,
After I changed the threshold of ($newlen/$len)
in IRFinder/bin/util/IntronExclusion.pl
from 0.7 to 0.4, and reran Build-BED-refs.sh
, the introns are now present in ref-cover.bed
. See the IGV screenshot below, the introns of interest are marked by arrows. I include ref-cover.bed
from default setting (upper panel) and modified setting (lower panel) for comparison.
Thanks for your great help!
No problem. I hope IRFinder is useful to your project. I'll mark this discussion as closed
.
Hi, I'm studying the intron retention ratio of a complex gene. It has 40+ transcript isoforms in the ensembl annotation (hg38). IRFinder finished without any error. In the result (IRFinder-IR-dir.txt), some of the introns are present, but a few are missing.
I think IRFinder makes the decision which introns to keep, when it builds the reference. Maybe IRFinder gets confused by so many isoforms. Would you explain why some introns are excluded?
Any suggestion if I really want to get the IR ratio of some missing introns? I'm thinking of building a gtf with a down-sized gtf file with less isoforms, maybe that makes IRFinder less confused. But I'm not sure which isoforms to include, because I don't know how IRFinder picks the good introns.
Thank you.