Columns on which to filter

andreismol commented 4 years ago

Hi,

Thanks for the tool. It seems to be working excellently. I've got a few questions that I'd like to clarify.

Could you provide the definitions of the four Warnings in the last column of the IRFinder-IR-nondir output file (LowCover, LowSplicing, MinorIsoform, NonUniformIntronCover). Specifically, at what cutoff do LowCover and LowSplicing get triggered?
I was wondering whether you might be able to provide some advice on which columns to filter to get a set of high quality introns in each sample. In a response you gave to a previous issue, you mentioned that you filter on:

Column 8th (percentage of intron region covered by RNAseq reads): >=0.7;
Column 19th (number of "correct" splicing that splices out the intron): >=10 or >=5 depending on RNAseq depth;
Column 21st (quality control of RNAseq reads that support the current intron): keep ones with the mark - or NonUniformIntronCover

What's the logic behind these particular columns? Obviously you would want to exclude introns which are poorly covered, but why exclude on the basis of low numbers of "correct" splicing? Why include introns with NonUniformIntronCover? And are there any other columns which one should filter on? (at the moment I'm filtering on IntronDepth>3 and IRratio>0.1)

How important are the "static warnings" in the Name column (i.e. clean, anti-over, anti-near, etc)? As I understand it these indicate whether or not the intron overlaps with other features, but aren't these regions already excluded during the genome preparation stage? Would there be any reason to filter on these static warnings too?

Thanks again for all the effort involved in producing and documenting IRFinder!

-Andrei

dg520 commented 4 years ago

Hi @andreismol , To your Q1: LowCover: correct splicing at column 19 + intron average depth at column 9 < 10, meaning the overall sequencing depth for this event is low.
LowSplicing: correct splicing at column 19 < 4, meaning not enough reads supporting the correct splicing. MinorIsoform: correct splicing at column 19 * 1.33333333 < max(spliceLeft at column 17, spliceRight at column 18), meaning the event is not the main/most common splicing outcome among all transcripts of this gene.
NonUniformIntronCover: this is a bit complicated as follow:

(max(SPleft, SPright) > intronTrimmedMean+2 && max(SPleft, SPright) > intronTrimmedMean*1.5

where SPleft, SPright, intronTrimmedMean are column 13, 14 and 19, respectively. This tag examine if read coverage is evenly distributed in the exonic and intronic region of an event. All the above is defined in ReadBlockProcessor_CoverageBlocks.cpp under IRFinder/src/irfinder.

Q2: We filter for sufficient correct splicing for two reasons:

we want to be sure splicing happens, indicating the annotated event is a true intron.
this value takes a large part of the denominator of IR ratio. A small value for this will make the IR ratio too sensitive against the noisy intronic reads to be overestimated. As you can see, all the cutoffs in your Q1 are arbitrary including NonUniformIntronCover. Specifically to NonUniformIntronCover, sometimes the cutoff might be too stringent due to its complicated criteria. With that being said, the cutoffs you listed in your Q2 are general guidance. I recommend you to investigate your own data to figure out which combination suits you better.

Q3: Your understanding of static warning is right. We indeed exclusion those contaminated regions, but not the entire events, in the calculation. Thus, users have their freedom to apply there own choice whether or not to exclude the whole event if it is not clean.

Best, Dadi

andreismol commented 4 years ago

Thank you, Dadi! Much appreciated.

williamritchie / IRFinder

Columns on which to filter #85