xzhoulab / iDEA

Differential expression (DE); gene set Enrichment Analysis (GSEA); single cell RNAseq studies (scRNAseq)
GNU General Public License v3.0
32 stars 11 forks source link

NES from GSEA #18

Closed RawanSO closed 2 years ago

RawanSO commented 2 years ago

Hi, What is the equivalent NES score from GSEA in iDEA? How to get the enrichment score in iDEA? Thanks

MardahlM commented 2 years ago

Hi, I am also interested in the answer to this question:-). Cheers, Maibritt

YingMa0107 commented 2 years ago

Hi@RawanSO and @MardahlM,

For the enrichment score (ES) in GSEA, it is calculated through walking down the raked list of genes for each specific gene set. It aims to determine whether the genes in a specific gene set tend to occur toward the top (or the bottom) of the ranked list. The enrichment score reflects the degree to which the genes in a gene set are overrepresented at the top or bottom of the entire ranked list of genes. And the NSE is just the normalized ES which accounts for differences in gene set size and in correlations between gene sets and the expression dataset.

Different from GSEA, iDEA is a model-based framework which performs joint DE and GSE analysis through a hierarchical Bayesian framework. We don't have the exact equivalent enrichment score in our model, instead the anno_coef in the output, which is denoted by the tau parameter in iDEA's model (Details see Methods section), is defined as a gene set enrichment parameter that determines the odds ratio of DE for genes inside the gene set versus genes outside the gene set. And the sign of annot_coef determines whether DE genes are enriched or depleted in the gene set.

Hope this helps! Let me know if you have further questions!

aguang commented 2 years ago

I have an additional question to this. Since the summary statistics are from comparisons between 2 groups, say group A vs group B where positive logfoldchange means A has higher expression than B, then if annot_coef for a given gene set is positive does that mean the pathway is enriched (roughly speaking, I understand it is an odds ratio) for A relative to B? If the summary statistics I provided were the opposite signs (B vs A) would I get the opposite signs for annot_coef as well?

YingMa0107 commented 2 years ago

Hi @aguang,

Thank you for your interest in our method and this is a good point. So my initial answer for your question is that the annot_coef sign will not change because it is defined as the odds ratio of being DE genes for genes inside the gene set versus the genes outside the gene set. So as long as the DE genes are the same, the sign of the annot_coef will not change.

Also I tried a toy example to explore the changes and I used the summary statistics from Seurat tutorial: https://satijalab.org/seurat/articles/de_vignette.html

library(iDEA)
data(annotation_data)##### First make a R package:
library(Seurat)
library(SeuratData)
SeuratData::InstallData("pbmc3k")
pbmc <- LoadData("pbmc3k", type = "pbmc3k.final")
#### test  1 compare ident.1 versus ident.2
monocyte.de.markers <- FindMarkers(pbmc, ident.1 = "CD14+ Mono", ident.2 = "FCGR3A+ Mono")
pvalue <- monocyte.de.markers$p_val#### the pvalue column
zscore <- qnorm(pvalue/2.0, lower.tail=FALSE) #### convert the pvalue to z-score
beta <- monocyte.de.markers$avg_log2FC## effect size
se_beta <- abs(beta/zscore) ## to approximate the standard error of beta
beta_var = se_beta^2  ### square
summary = data.frame(beta = beta,beta_var = beta_var)
rownames(summary) = rownames(monocyte.de.markers) ### or the gene id column in the res_DE results
idea <- CreateiDEAObject(summary, annotation_data[,1:10], max_var_beta = 100, min_precent_annot = 0.0025, num_core=10)
set.seed(1)
idea <- iDEA.fit(idea, fit_noGS=FALSE, init_beta=NULL, init_tau=c(-2,0.5), min_degene=5, em_iter=15, mcmc_iter=1000, fit.tol=1e-5, modelVariant = F, verbose=TRUE)
idea <- iDEA.louis(idea) ## 
head(idea@gsea[,1:2])
                                          annot_id annot_coef
1                    GO_CELLULAR_RESPONSE_TO_LIPID  0.3766660
2                             GO_SECRETION_BY_CELL  0.1015651
3 GO_REGULATION_OF_CANONICAL_WNT_SIGNALING_PATHWAY  0.4479713
4            GO_REGULATION_OF_DEVELOPMENTAL_GROWTH  0.4328906
5        GO_CELLULAR_RESPONSE_TO_EXTERNAL_STIMULUS -0.1649228
6                  GO_ACTIN_FILAMENT_BASED_PROCESS  0.4510983

##### change the relationship 
#### test  1 compare ident.2 versus ident.1
monocyte.de.markers2 <- FindMarkers(pbmc, ident.2 = "CD14+ Mono", ident.1 = "FCGR3A+ Mono")
pvalue2 <- monocyte.de.markers2$p_val#### the pvalue column
zscore2 <- qnorm(pvalue2/2.0, lower.tail=FALSE) #### convert the pvalue to z-score
beta2 <- monocyte.de.markers2$avg_log2FC## effect size
se_beta2 <- abs(beta2/zscore2) ## to approximate the standard error of beta
beta_var2 = se_beta2^2  ### square
summary2 = data.frame(beta = beta2,beta_var = beta_var2)
rownames(summary2) = rownames(monocyte.de.markers2) ### or the gene id column in the res_DE results
idea2 <- CreateiDEAObject(summary2, annotation_data[,1:10], max_var_beta = 100, min_precent_annot = 0.0025, num_core=10)
set.seed(1)
idea2 <- iDEA.fit(idea2, fit_noGS=FALSE, init_beta=NULL, init_tau=c(-2,0.5), min_degene=5, em_iter=15, mcmc_iter=1000, fit.tol=1e-5, modelVariant = F, verbose=TRUE)
idea2 <- iDEA.louis(idea2) ## 
head(idea2@gsea[,1:2])
                                          annot_id  annot_coef
1                    GO_CELLULAR_RESPONSE_TO_LIPID  0.30098040
2                             GO_SECRETION_BY_CELL  0.09674772
3 GO_REGULATION_OF_CANONICAL_WNT_SIGNALING_PATHWAY  0.49213259
4            GO_REGULATION_OF_DEVELOPMENTAL_GROWTH  0.43674055
5        GO_CELLULAR_RESPONSE_TO_EXTERNAL_STIMULUS -0.16788215
6                  GO_ACTIN_FILAMENT_BASED_PROCESS  0.50108307

So the sign does not change and the little variability here might be due to the estimation of posterior mean of the DE effect size in the MCMC sampling, but it should not affect the rank of the gene sets from top enriched gene sets to the bottom ones. Hope this is helpful for you to understand this question!

aguang commented 2 years ago

I see, that is helpful thank you. But then the sign of annot_coef does not have anything to do with whether the DE genes in the gene set are enriched or not, correct? Which is what you commented earlier:

Different from GSEA, iDEA is a model-based framework which performs joint DE and GSE analysis through a hierarchical Bayesian framework. We don't have the exact equivalent enrichment score in our model, instead the anno_coef in the output, which is denoted by the tau parameter in iDEA's model (Details see Methods section), is defined as a gene set enrichment parameter that determines the odds ratio of DE for genes inside the gene set versus genes outside the gene set. And the sign of annot_coef determines whether DE genes are enriched or depleted in the gene set.

denvercal1234GitHub commented 1 month ago

@aguang - Did you find answers to your second follow-up question?