xzhoulab / iDEA

Differential expression (DE); gene set Enrichment Analysis (GSEA); single cell RNAseq studies (scRNAseq)
GNU General Public License v3.0
30 stars 11 forks source link

pvalue_louis from idea@gsea #8

Closed tinggithub33 closed 2 years ago

tinggithub33 commented 3 years ago
I have a question about pvalue_louis as below. When p value =2.12E-28, pvalue_louis = 7.20E-26 (second row from bottom). this make sense to me. But when pvalue is 0, pvalue_louis is 0.99999XXX. it seems weird to me. do you think it is right? any comments will be greatly appreciated. Best reargds, Tingfen annot_id pvalue_louis pvalue
KEGG_CITRATE_CYCLE_TCA_CYCLE 0.99999838 0
GO_ER_TO_GOLGI_TRANSPORT_VESICLE_MEMBRANE 0.99999838 0
GO_NEGATIVE_REGULATION_OF_MUSCLE_CELL_DIFFERENTIATION 0.99999839 0
MARIADASON_REGULATED_BY_HISTONE_ACETYLATION_DN 0.9999984 0
NIKOLSKY_BREAST_CANCER_12Q13_Q21_AMPLICON 0.9999984 0
FONTAINE_THYROID_TUMOR_UNCERTAIN_MALIGNANCY_UP 0.9999984 0
GO_SMALL_SUBUNIT_PROCESSOME 0.99999841 0
DUTTA_APOPTOSIS_VIA_NFKB 0.99999842 0
GO_RESPONSE_TO_ZINC_ION 7.20E-26 2.12E-28
GO_CELLULAR_RESPONSE_TO_INORGANIC_SUBSTANCE 9.86E-21 2.63E-23
tinggithub33 commented 3 years ago

below is the script I ran. and I found a message from standard output as:

===== iDEA INPUT SUMMARY ====

number of annotations: 9759

number of genes: 10374

number of cores: 10

fitting the model with gene sets information...

[1] "try-error!" . . what does "try-error!" tell? Thank you very much for any comments! Tingfen

#######################scripts###############

output from MAST

d <- read.delim(sprintf("%s/%s",indir,input_file))

using coef and z score to calculate se

d$lfcSE=abs(d$coef/d$z) rownames(d) <- d$primerid summary_data=na.omit(d[,c("coef","lfcSE")]) colnames(summary_data) <- c("logFoldChange","lfcSE")

Create an iDEA object

data(humanGeneSets) annotation_data=humanGeneSets idea <- CreateiDEAObject(summary_data, annotation_data, max_var_beta = 100, min_precent_annot = 0.0025, num_core=10)

fit model

idea <- iDEA.fit(idea, fit_noGS=FALSE, init_beta=NULL, init_tau=c(-2,0.5), min_degene=5, em_iter=15, mcmc_iter=1000, fit.tol=1e-5, modelVariant = F, verbose=TRUE)

correct p value

idea <- iDEA.louis(idea) gsea <- idea@gsea

write.table(gsea, sprintf("%s/iDEA/%s/gsea_%s_z.txt",indir,cellType,out_file),row.names = F, quote=F, sep="\t")

ZHIDIHUAYUAN commented 3 years ago

Hi, I also encountered the same problem. ################# idea.null <- iDEA.louis(idea.null) [1] "try-error!" [1] "try-error!" [1] "try-error!" [1] "try-error!" [1] "try-error!"

mcclo commented 3 years ago

Hi, I've been having a similar issue implementing and understanding pvalue_louis. Running DE of 10,000 genes (logFC from -4 to 4) from edgeR and using variance (back-calculated from edgeR p-value -> zscore), I get the following plot with little significance. Very excited to figure this out and implement iDEA for our studies!

test

YingMa0107 commented 3 years ago

Hi, @tinggithub33 and @ZHIDIHUAYUAN,

The inconsistency between pvalue_louis and pvalue might because of the estimation of the annotation coefficient estimation and the variance estimation of the coefficient. I suspect that the annot_coef estimate for the gene sets with the pvalue_louis near 1 are quite small, i.e. < -10. This extremely small estimate will lead to the probability of the gene in this gene set being the DE gene quite small. Then the variance estimates in the EM algorithms will be super large. When we use Louis method to correct for the information matrix which was need to estimate the variance of the gene set enrichment coefficient, with those small extremely parameters, the variance estimates will usually to be very large. I suspect the number of genes from your summary statistics as well as in these questionable gene sets are very small, which might cause this issue.

For the "try error" message, it's just the reminder that the variance estimates by Louis method are negative, this might also due to the extremely small gene set size. These should not be a big issue and I expect this should be happening among only a few gene sets with extremely small gene set size.

YingMa0107 commented 3 years ago

Hi @mcclo,

pvalue_louis here is calculated when we use Louis Method to correct for the information matrix calculated in the EM algorithm. Based on your plot, what does the distribution of your input variance and input p-value of the genes look like? Perhaps more details would be helpful to understand the little significance here.

mcclo commented 3 years ago

Hi @YingMa1993 ,

Below are histograms of the input variance and p-value distributions. Thanks in advance for taking the time to help out!

To calculate beta_var from pvalue: zscore <- qnorm(pvalue/2.0, lower.tail=FALSE) beta <- markers$log2FC beta_var <- beta/zscore

Beta_var

Pval

YingMa0107 commented 3 years ago

Hi @mcclo,

Thanks for providing those details! I think here the beta_var should be beta_var <- (beta/zscore)^2 since it is the variance so it should be the square of the standard error when SE = abs(beta / zscore). I've add more details about how to calculate the SE and beta_var in our tutorial.