williamritchie / IRFinder

Detecting intron retention from RNA-Seq experiments
53 stars 25 forks source link

Questions regarding DESeQ2 analysis of intronic reads #140

Closed me37uday closed 3 years ago

me37uday commented 3 years ago

With respect to the sample code in the manual :

  1. Why does one have to analyse the counts in terms of IR and splice in the design formula? Why couldn't it simply be condition (WT vs KO)?
  2. The IRrtio.WT and IRratio.KO, is it a way of normalisation that factors for variable sequencing depths?
  3. What does the variable IR.change ultimately represent?

I'd be glad if you could walk me through these questions in the simplest way possible.

Many thanks!!

dg520 commented 3 years ago

@me37uday

  1. The direct counts, or intron expression comparison, will be confounded by the difference in overall host gene expression. Consider this: Intron_1 expressed at 10 reads and 1 read in WT and KO respectively, while its host genes expressed at 1000 reads and 100 reads in WT and KO respectively. The intron retention level doesn't increase actually. This emphasizes the essential of measuring IR as a ratio, for example, IR = (average intron depth/normal splicing).

  2. Yes. but more than that. The ratio form not only controls the difference in sequencing depth, but also cancels out the difference in gene expression.

  3. It tells you the effect size of intron retention change.

me37uday commented 3 years ago

Okay. Thanks for that.

"Consider this: Intron_1 expressed at 10 reads and 1 read in WT and KO respectively, while its host genes expressed at 1000 reads and 100 reads in WT and KO respectively. The intron retention level doesn't increase actually. "

So if I understood it right and for the sake of completeness, the IR ratio in this scenario would be :

IRratio.WT = 1000/10 = 100 IRratio.KO = 100/1 = 100 IR.change = 0, the effect size of IR change.

Is that right?

dg520 commented 3 years ago

@me37uday Correct, except I would say 10/1000 and 1/100 respectively.

kikegoni commented 3 years ago

Hey, First thanks a lot for implementing IRFinder, it it super helpful.

Regarding the first question in this Issue, I understand the reasoning for using the intron ratio. My concerning is that why are you using the intron depth and the splicing instead of the Intron ration (column 20)?. Wouldn't that lead to the same result but simplifying the design formula?

Thanks for any help you can provide here.

Best,

Kike

dg520 commented 3 years ago

@kikegoni Are you using the DESeq2-derived statistical module? If so, integers are required to fit GLM using a negative binomial distribution. If you want to use ratios to fit a model, you might want to transformed them first (e.g. log) followed by checking the normality of the distribution. If the transformed ratios are normally distributed, you might be able to fit a LM instead of GLM. When you have a small sample size like 3 vs 3, it will be hard to verify the normality though.

kikegoni commented 3 years ago

Hey @dg520 ,

Thanks a lot for your fast reply. Yes I am using DESeq2. I know it only accepts integer numbers. My point is that why are you suggesting using in your "DESeqDataSetFromIRFinder" the intron depth (column 9 of IRFinder-IR-dir.txt) and the splice info (mas value from columns 17 and 18 from IRFinder-IR-dir.txt) to generate the count matrix and the DESeq2 object, instead of you populating it from the Intron Ratio (column 20 in IRFinder-IR-dir.txt)?

Also, why do you used DESeq2 instead of edgeR which accepts decimal numbers?

Thanks a lot in advance,

Kike

dg520 commented 3 years ago

@kikegoni Column 20 are ratios between [0, 1], which is calculated from column 9, 17 and 18. Using Column 20 alone doesn't make any sense to fit a GLM.

I'm not very familiar with edgeR, but I believe edgeR uses GLM under the assumption of negative binomial distribution as well, which requires its directly input to be integers. Not sure where you see decimal numbers as input. Are you sure decimal numbers are directly used to fit a model? Is it a GLM or LM?

Nevertheless, there are many ways to do differential analysis. The key point is to find a statistics that best fit the distribution of your data. GLM is commonly used for RNASeq studies, because counts are the direct measurement in RNASeq. The assumption is that counts data should fit well with Poisson or Poisson-related distributions.

Like I said, if you want to use ratios instead of counts, it is totally fine, You just need to figure out what is the best statistics for comparing ratios in a differential analysis and if your sample size and distribution meet that criteria.

kikegoni commented 3 years ago

Hey @dg520 ,

Understood, my mistake. edgeR used also a GLM and it needs integer counts. So I suppose that if I have two conditions and I am interesting in finding the introns which are differentially retained between the two conditions I have to just compare those two contrasts, right? As you're doing in your DE tutorial here:

# Finally we can test the difference of (intronic reads/normal spliced reads) ratio between WT and KO
res.diff = results(dds, contrast=list("ConditionKO.IRFinderIR","ConditionWT.IRFinderIR"))  

Or what is the way to assess the IR in with the "results" command from DESeq2?

Thanks a lot for all your help and for the fast replies.

Kike

dg520 commented 3 years ago

@kikegoni In the command:

res.diff = results(dds, contrast=list("ConditionKO.IRFinderIR","ConditionWT.IRFinderIR")) 

In the above results function, dds is the variable name of the DESeq2Object data slot generated by the IRFinder built-in function DESeqDataSetFromIRFinder. For the contrast parameter, the part of ".IRFinderIR" is always there, as it's assigned by DESeqDataSetFromIRFinder and is not configurable for users. The parts of ConditionKO and ConditionWT really depend on your design matrix. In the IRFinder tutorial, we want to apply differential analysis on the column named Condition in the design matrix. This column has two levels. namely KO and WT. So ConditionKO and ConditionWT tells the results function the two stuff you want to compare. You can also use the DESeq2 function resultsNames to see which comparisons are available.

Overall, the commands follows the exact logic of DESeq2, as results is a DESeq2 function. I highly recommend you to read DESeq2 manual first to better understand how it works. And here is a useful link of a similar use case to IR detection created by the DESeq2 author.

kikegoni commented 3 years ago

Hey @dg520

Thanks for the answer and for the fast response. Also for the link you pointed me, really helpful.

Yes, I have experience working with DESeq2. My question was more in the direction of: If I have two conditions (let's say A and with several replicates) and I want to compare to IR between them.

This will be the colData from the experiment: condition IRFinder A IR B IR A Splice B Splice

So with the following design matrix:

design(dds) = ~condition + condition:IRFinder

So, if we do (as pointed in your tutorial):

res.diff = results(dds, contrast=list("conditionA.IRFinderIR","conditionB.IRFinderIR"))

What we are actually doing is comparing the ration between Intron reads to Splice Reads (IR/SR) whereas the Intron ratio is the Intron reads divided by the total reads (IR/(IR+SR)). Both of these situation should give equivalent log2FC but my question is if in terms of p-value both would be also equivalent. So can we assess the significances of differences in Intron retention ratio between our two conditions A and B by comparing their IR/SR ratios?

Again, thanks for all your help and sorry for so many questions. Just want to have a proper understanding.

Kike

dg520 commented 3 years ago

@kikegoni This is a good question. First of all, what we really want to test is the ABSOLUTE change of IR ratio between two conditions, NOT the FOLD change. Unfortunately, this is not do-able via GLM because of its property (i.e. log is used as the link function, so only fold change is testable).

As we can only ask for the significant of fold change, we turn the question into: whether the change in the intronic reads is stronger than the change in the splicing. Either the fold change of IR/SR or the folder change of (IR/(IR+SR)) will do that for you, while the latter one is more conservative in the fold change estimation.

So the log2FC will NOT be equivalent between the use of IR/SR and (IR/(IR+SR)). P value will be surely inequivalent as well.

There is nothing wrong to choose either way. If you prefer the conservative approach, you can change the line in DESeqDataSetFromIRFinder from

spl=cbind(spl,tmp6)

to

spl=cbind(spl,tmp1+tmp6)

where tmp1 represents intronic reads and tmp6 represents SR reads.

kikegoni commented 3 years ago

Hey @dg520 ,

Thanks a lot! Understood! I will change that line!

Best,

Kike