TOAST clarifications - Githubissues

sam-israel commented 3 years ago

In the CARSeq article, it is stated about TOAST that

TOAST defines the effect size as β/(µ + β/2), where µ is base-line expression in one group, and β is the gene expression difference between two groups.

Can you confirm that? If that is correct, then how come µ can be negative? Even in the vignette:

    res_table <- csTest(fitted_model, 
                    coef = "disease", 
                    cell_type = "Bcell")

#    Test the effect of disease in Bcell.

    head(res_table, 3)

                  beta   beta_var          mu effect_size f_statistics
     cg07075387 -0.7321689 0.03946395 -0.06071048    1.715505     13.58382
     cg13293535 -0.4776350 0.01825102  0.03636267    2.359218     12.49986
     cg15300101 -0.4525076 0.01625961 -0.13260057    1.260978     11.37536

What procedure would you suggest for genes with a negative base-line expression?

When not specifying the contrast for csTest, which way the comparison is made? It is crucial not to be mistaken about this. For example in
```
res_table <- csTest(fitted_model, coef = "Condition")
head(Design_out$design$Condition)
[1] No  No  Yes No  No  No 
Levels: No Yes  
```
Is the comparison Yes - No, or No - Yes ?
If I want to filter only for genes with a decent effect size, what filtering would you suggest? Is it recommended to calculate the fold change as (μ+β)/μ and filter fold change > 2 ?

Could you explain what "testing the joint effect in all cell types" means? Such as in :


res_table <- csTest(fitted_model, 
                coef = "disease", 
                cell_type = "joint")



What does a high number of significant genes in "joint" mean (almost all genes) ?

5. Does TOAST expect the input proportions to sum to 1 for each sample? If the proportions are absolute proportions from ABIS (which do not necessarily sum to 1), would you recommend rescaling?

ziyili20 commented 3 years ago

Hi sam-israel, Thanks for the questions. 1, The csDE function in TOAST uses linear model without additional constraints on the parameters. Unfortunately the negative mu estimation is unavoidable. Currently I don't have a procedure for correcting these negative base-line expressions. Sorry about that.

I recommend you change the Condition variable to a factor with 0 and 1 (e.g. 1 for Yes and 0 for No). For example, Design_out$design$Condition_num <- factor(ifelse(Design_out$design$Condition == "Yes", 1, 0)). TOAST will then compare 1 versus 0, i.e. Yes versus No.
This highly depends on the data. You can try different threshold and see which result makes more sense.
For each cell type i, there is a beta_i. The baseline cell type has mu. Joint test uses an F test to jointly test, mu=beta_1=beta_2....=0. For joint test, a significant value indicate there is change in at least one cell type. Sometimes we observe high rate of false positives using joint test. You could consider permute your phenotype variable and obtain permutation p values to validate whether the signals from joint test are real.
Yes, we expect the input proportions to sum to 1. Please rescale the proportion matrix if the proportion does not sum to 1 for every subject.

sam-israel commented 3 years ago

Hi, Thank you for the answers.

A possible way of making the multiple comparisons correction (fdr) less strict is by pre-filtering the number genes TOAST is applied to. If I pre-filter the genes (based upon their average TPM) the total number of the genes will be less, and the fdr correction will correct the p-values less strictly. My question is if this is recommendable from the de-convolution point of view. Will TOAST operate with maximum efficiency if receiving all (human) genes as input? If pre-filtering recommended ? Is there an minimal recommended number of genes?
Is calculating fold change as (μ+β)/μ correct? For what purposes the effect_size should be used rather than the fold change?
What would be a good way of calculating some measure of fold change when μ is negative? I want to see if there is an enrichment in the results, via inputting the genes into a software such as GSEA. For that I need to order by pvalue, and by a measure of change.
Would (|μ|+β)/|μ| work?

In my dataset

summary(myres$MAIT$mu)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-12505.83      0.04      5.05     65.54     24.45 131194.07

Hence both fold change (calculated manually) and effect_size can give negative values.

summary(myres$MAIT$effect_size)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-4454.514     0.042     1.031     0.698     1.788  2235.051

myres$MAIT$foldchange <- (myres$MAIT$mu+myres$MAIT$beta)/myres$MAIT$mu

summary(myres$MAIT$foldchange)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-99479.19     -0.26      1.30     -5.00      3.81   9113.20

ziyili20 commented 3 years ago

Hi sam-israel,

For "My question is if this is recommendable from the de-convolution point of view. Will TOAST operate with maximum efficiency if receiving all (human) genes as input? If pre-filtering recommended ? Is there an minimal recommended number of genes?" Yes, we recommend performing some pre-filtering on the data to remove genes with low expressions. We haven't explored how that impacts the DE results thus no minimal recommended number is available. Previously we have used some ad-hoc approach, like filtering out the genes with mean expression < 2. You can explore using different filtering thresholds.

For the question about fold change, the purpose of our effect size calculation is to provide measure similar to folder change. Let me explain why we choose β/(µ + β/2). Think of two genes A and B. In the two conditions (non-diseased and diseased), their expressions in the same cell type are 10 and 110 (gene A), 10000 and 10100 (gene B). Then β is 100 for both gene A and B, but µ are 10 and 10000 for A and B, respectively. (μ+β)/μ will give 11 and 1.01, β/(µ + β/2) will give 1.67 and 0.00995. Both rank gene A higher than gene B. However, when μ is very small, e.g. μ = 0.001, (μ+β)/μ is not as stable as β/(µ + β/2). I am hesitated to interpret the negative values as those may be resulted from an improper model fit... Maybe those are something we could work on in the future.

Hope this helps.

sam-israel commented 3 years ago

Thank you.

How can QC be applied to the list of significant genes? For example, would you recommend filtering out those genes for which the mu is negative, and performing enrichment tests only on the significant genes with positive mu?
What sanity checks would you recommend to look at for TOAST results (if any)?

ziyili20 commented 3 years ago

Hi sam-israel,

These are great questions. Honestly, there is a need for research and evaluation toward these directions. What we currently do is not using QC approaches, but to communicate with our biological collaborators and seek their opinion to understand whether the results make sense. Another thing we find is that, the quality of the findings is highly correlated with the cell type abundance. The cell type DEs identified for cell types with proportions ~ 0.4 are much more reliable than a rare cell type with proportion ~ 0.05. For very rare cell type with proportion < 0.05, it is likely that the majority csDE findings are false positives. But we didn't quantitively evaluate these so far. Hope this helps.

ziyili20 / TOAST

TOAST clarifications #6