xzhoulab / iDEA

Differential expression (DE); gene set Enrichment Analysis (GSEA); single cell RNAseq studies (scRNAseq)
GNU General Public License v3.0
32 stars 11 forks source link

Encountering Error when Correcting p-values with iDEA.louis() method #22

Closed PottsC closed 2 years ago

PottsC commented 2 years ago

Hi there, I'm encountering the following error when executing the Louis method p-value correction method as demonstrated in your tutorial.

Error in if (!is.na(res$annot_coef[2])) { : argument is of length zero

In trying to disagnose whats wrong, I notice that the de slot conatins annotations of Type NULL image

For the summary statistics, I calculated (using your example from the tutorial) the beta and beta_var vlues from p-value and log2FC from the FindConservedMarkers function from within Seurat.

> head(cluster0_DEstats)
             beta    beta_var
ABCA10  0.4089271 0.015240638
ABCA6   0.5550807 0.017470111
ABCA8   0.3439423 0.024959189
ABCA9   0.4771893 0.013693380
ABLIM1 -0.5137350 0.006145706
ACACB  -0.6458339 0.019331025

For the gene specific annotations, I selected the rows from within the humanGeneSets data included with iDEA that corresponded with the genes returned from the DE analysis.

> cluster0_annotations[1:3,1:3]
       NAKAMURA_CANCER_MICROENVIRONMENT_DN WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP WINTER_HYPOXIA_UP
ABCA10                                   0                                    0                 0
ABCA6                                    0                                    0                 0
ABCA8                                    0                                    0                 0

This is the command I used to create the iDEA object:

cluster0_idea <- CreateiDEAObject(cluster0_DEstats, cluster0_annotations, max_var_beta = 100, min_precent_annot = 0.0025, num_core=1)

And this is the command I used to fit the model:

cluster0_idea <- iDEA.fit(cluster0_idea,
                 fit_noGS=FALSE,
                   init_beta=NULL, 
                   init_tau=c(-2,0.5),
                   min_degene=5,
                   em_iter=15,
                   mcmc_iter=1000, 
                   fit.tol=1e-5,
                       modelVariant = F,
                   verbose=TRUE)

For information, here is my session info:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /shared/centos7/r-project/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /shared/centos7/r-project/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] devtools_2.4.3              usethis_2.1.5               forcats_0.5.1               stringr_1.4.0               purrr_0.3.4                
 [6] readr_2.1.2                 tidyr_1.2.0                 tibble_3.1.7                tidyverse_1.3.1             iDEA_1.0.1                 
[11] HGNChelper_0.8.1            ggsignif_0.6.3              metap_1.8                   DropletUtils_1.10.3         SingleCellExperiment_1.12.0
[16] SummarizedExperiment_1.20.0 Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.7         IRanges_2.28.0             
[21] S4Vectors_0.32.3            BiocGenerics_0.40.0         MatrixGenerics_1.6.0        matrixStats_0.62.0          glue_1.6.2                 
[26] patchwork_1.1.1             cowplot_1.1.1               sctransform_0.3.3           ggplot2_3.3.6               SeuratObject_4.0.4         
[31] Seurat_4.1.0                dplyr_1.0.8                

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                reticulate_1.24           R.utils_2.11.0            tidyselect_1.1.2          htmlwidgets_1.5.4        
  [6] grid_4.0.2                BiocParallel_1.24.1       Rtsne_0.15                munsell_0.5.0             codetools_0.2-16         
 [11] mutoss_0.1-12             ica_1.0-2                 future_1.24.0             miniUI_0.1.1.1            withr_2.5.0              
 [16] spatstat.random_2.1-0     colorspace_2.0-3          knitr_1.39                rstudioapi_0.13           ROCR_1.0-11              
 [21] tensor_1.5                pbmcapply_1.5.1           listenv_0.8.0             Rdpack_2.1.4              GenomeInfoDbData_1.2.4   
 [26] mnormt_2.0.2              polyclip_1.10-0           rhdf5_2.34.0              rprojroot_2.0.2           parallelly_1.30.0        
 [31] vctrs_0.4.1               generics_0.1.2            TH.data_1.1-0             xfun_0.30                 R6_2.5.1                 
 [36] doParallel_1.0.17         locfit_1.5-9.4            cachem_1.0.6              bitops_1.0-7              rhdf5filters_1.2.1       
 [41] spatstat.utils_2.3-0      DelayedArray_0.20.0       assertthat_0.2.1          promises_1.2.0.1          scales_1.2.0             
 [46] multcomp_1.4-18           gtable_0.3.0              beachmat_2.6.4            globals_0.14.0            processx_3.5.3           
 [51] goftest_1.2-3             sandwich_3.0-1            rlang_1.0.2               splines_4.0.2             lazyeval_0.2.2           
 [56] spatstat.geom_2.3-2       broom_0.8.0               modelr_0.1.8              reshape2_1.4.4            abind_1.4-5              
 [61] backports_1.4.1           httpuv_1.6.5              tools_4.0.2               ellipsis_0.3.2            spatstat.core_2.4-0      
 [66] RColorBrewer_1.1-3        sessioninfo_1.2.2         ggridges_0.5.3            TFisher_0.2.0             Rcpp_1.0.8.3             
 [71] plyr_1.8.6                sparseMatrixStats_1.6.0   zlibbioc_1.36.0           RCurl_1.98-1.6            prettyunits_1.1.1        
 [76] ps_1.7.0                  rpart_4.1-15              deldir_1.0-6              pbapply_1.5-0             zoo_1.8-9                
 [81] haven_2.5.0               ggrepel_0.9.1             cluster_2.1.0             fs_1.5.2                  magrittr_2.0.3           
 [86] data.table_1.14.2         scattermore_0.8           openxlsx_4.2.5            reprex_2.0.1              lmtest_0.9-39            
 [91] RANN_2.6.1                tmvnsim_1.0-2             mvtnorm_1.1-3             fitdistrplus_1.1-6        pkgload_1.2.4            
 [96] hms_1.1.1                 mime_0.12                 xtable_1.8-4              readxl_1.3.1              gridExtra_2.3            
[101] testthat_3.1.4            compiler_4.0.2            KernSmooth_2.23-17        crayon_1.5.1              R.oo_1.24.0              
[106] htmltools_0.5.2           tzdb_0.3.0                mgcv_1.8-31               later_1.3.0               snow_0.4-4               
[111] lubridate_1.8.0           DBI_1.1.2                 dbplyr_2.1.1              MASS_7.3-51.6             Matrix_1.4-0             
[116] brio_1.1.3                cli_3.3.0                 R.methodsS3_1.8.1         rbibutils_2.2.7           parallel_4.0.2           
[121] qqconf_1.2.1              igraph_1.2.11             pkgconfig_2.0.3           sn_2.0.1                  numDeriv_2016.8-1.1      
[126] plotly_4.10.0             scuttle_1.0.4             spatstat.sparse_2.1-0     xml2_1.3.3                foreach_1.5.2            
[131] dqrng_0.3.0               multtest_2.50.0           XVector_0.30.0            rvest_1.0.2               callr_3.7.0              
[136] digest_0.6.29             RcppAnnoy_0.0.19          spatstat.data_2.1-2       cellranger_1.1.0          leiden_0.3.9             
[141] uwot_0.1.11               edgeR_3.32.1              DelayedMatrixStats_1.16.0 shiny_1.7.1               lifecycle_1.0.1          
[146] nlme_3.1-148              jsonlite_1.8.0            Rhdf5lib_1.12.1           desc_1.4.1                viridisLite_0.4.0        
[151] limma_3.46.0              fansi_1.0.3               pillar_1.7.0              lattice_0.20-41           pkgbuild_1.3.1           
[156] fastmap_1.1.0             httr_1.4.3                plotrix_3.8-2             survival_3.1-12           remotes_2.4.2            
[161] zip_2.2.0                 png_0.1-7                 iterators_1.0.14          stringi_1.7.6             HDF5Array_1.18.1         
[166] memoise_2.0.1             doSNOW_1.0.20             mathjaxr_1.6-0            irlba_2.3.5               future.apply_1.8.1       

Could I have some assistance?

The only thing I can think may be problematic is that I have provided a fairly small set of genes, totaling at 68 for this group, as that was the list returned by the Seurat FindConservedMarkers() method. The only other thing I can think of is that when I was calculating the beta and beta_var, my input data as log2FC not logFC. Should I adjust the calculation accoridngly to address this?

Thank you in advance, I am excited to use this tool as I analyze these data! Best, Christian

PottsC commented 2 years ago

Maybe the full traceback for the error will be helpful:

> cluster0_idea <- iDEA.louis(cluster0_idea)
  |                                                                                                      |   0%, ETA 00:09
Error in if (!is.na(res$annot_coef[2])) { : argument is of length zero
> traceback()
3: stop(results)
2: pbmclapply(1:num_annot, FUN = function(i) {
       res <- object@de[[i]]
       Annot <- rep(0, object@num_gene)
       Annotind = which(names(object@annotation) == names(object@de)[i])
       Annot[object@annotation[[Annotind]]] <- 1
       Annot <- Annot - mean(Annot)
       Annot <- as.matrix(data.frame(rep(1, object@num_gene), Annot))
       LouisMethod <- function(res, A) {
           numer <- (1 - res$pip) * res$pip * pnorm(res$beta, mean = 0, 
               sd = sqrt(res$sigma2_beta * res$sigma2_e)) * pnorm(res$beta, 
               mean = 0, sd = sqrt(res$sigma2_beta2 * res$sigma2_e))
           denom <- res$pip * pnorm(res$beta, mean = 0, sd = sqrt(res$sigma2_beta * 
               res$sigma2_e)) + (1 - res$pip) * pnorm(res$beta, 
               mean = 0, sd = sqrt(res$sigma2_beta2 * res$sigma2_e))
           sel_index <- which(!is.na(numer/denom^2))
           numer <- numer[sel_index]
           denom <- denom[sel_index]
           A <- A[sel_index, ]
           T <- matrix(0, ncol = 2, nrow = 2)
           T[1, 1] <- sum(as.numeric((numer/denom^2) * A[, 1] * 
               A[, 1]))
           T[1, 2] <- sum(as.numeric((numer/denom^2) * A[, 1] * 
               A[, 2]))
           T[2, 1] <- sum(as.numeric((numer/denom^2) * A[, 1] * 
               A[, 2]))
           T[2, 2] <- sum(as.numeric((numer/denom^2) * A[, 2] * 
               A[, 2]))
           annot_var <- diag(solve(res$info_mat - T))
           return(annot_var)
       }
       if (!is.na(res$annot_coef[2])) {
           try_test <- try(louis_var <- LouisMethod(res, Annot), 
               silent = T)
           if (class(try_test) == "try-error") {
               resTemp = NULL
           }
           else {
               resTemp = data.frame(annot_id = object@annot_id[Annotind], 
                   annot_coef = res$annot_coef[2], annot_var = res$annot_var[2], 
                   annot_var_louis = louis_var[2], sigma2_b = res$sigma2_beta)
           }
       }
       else {
           resTemp = NULL
       }
   }, mc.cores = getOption("mc.cores", num_core))
1: iDEA.louis(cluster0_idea)
YingMa0107 commented 2 years ago

Hi @PottsC,

Thank you for your interest in our package! And thank you for providing such a detailed description of your issue!

I agree with your thought that the total number of genes you provided in the summary stats is too small. iDEA requires all gene level summary statistics not just DE genes. When you used the FindConservedMarkers() or FindMarkers() function within Suerat, the default parameters will only output the significant genes. But you can actually modify the arguments in the FindConservedMarkers() function to output more genes.

There is a similar issue that we discussed before on the FindMarkers() function, https://github.com/xzhoulab/iDEA/issues/14. Based on the description of FindConservedMarkers(), it also contains the parameters to pass to FindMarkers, so I presume you should be able to pass some parameters in the FindConservedMarkers(), i.e. min.pct, logfc.threshold, only.pos, to make the function return more genes.

For example, you should be able to modify it as:

markers = FindConservedMarkers(... min.pct = 0,logfc.threshold = 0,only.pos = FALSE)`

Hope this helps!

PottsC commented 2 years ago

Hi there, I think that line of reasoning was correct, thanks so much for your help. I have resolved this issue by removing the FindMarkers restrictions and including more genes.

PottsC commented 2 years ago

Hello again! I am quickly reopening this issue to address a comment I made. This is no longer in response to the Louis p-value problems I was having before. My comment was as follows:

"The only other thing I can think of is that when I was calculating the beta and beta_var, my input data as log2FC not logFC. Should I adjust the calculation accoridingly to address this?"

Could you quickly let me know your thoughts on this? Seurat defaults to a base of 2 when calculating DE. I can change this and am wondering if I should or should not. In your tutorial you use LogFC (assuming that is base 10) to calculate the summary statistics. I'm thinking I should adjust this parameter in Seurat, am I correct? Apologies for the fairly basic statistics and math question, I am just trying to be thorough.

Thank you! Christian Potts

YingMa0107 commented 2 years ago

Hi @PottsC,

Thank you for posting your question here! When you calculate beta and beta_var correspondingly from the Seurat output, the p-value does not change, so the z-score will not change, which is invariant. In the methods section of iDEA paper, our model is actually equivalent to modeling using marginal z-statistics, where zj is the marginal z-score on the DE evidence for the j-th gene. As long as the p-value is consistent and then the z-score is consistent, the result is also consistent.

Also here I run a quick toy example for you to better understand this point with publicly available data (Note that this analysis might not have some biological meanings, just for evaluation purposes):

library(iDEA)
data(annotation_data)##### First make a R package:
library(Seurat)
library(SeuratData)
SeuratData::InstallData("pbmc3k")
#### let's use this data
pbmc <- LoadData("pbmc3k", type = "pbmc3k.final")
#### let's just randomly select two cell clusters and perform the DE
monocyte.de.markers <- FindMarkers(pbmc, ident.1 = "CD14+ Mono", ident.2 = "FCGR3A+ Mono")
pvalue <- monocyte.de.markers$p_val#### the pvalue column
zscore <- qnorm(pvalue/2.0, lower.tail=FALSE) #### convert the pvalue to z-score
#### for the original summary statistics
beta <- monocyte.de.markers$avg_log2FC## effect size
se_beta <- abs(beta/zscore) ## to approximate the standard error of beta
beta_var = se_beta^2  ### square
summary = data.frame(beta = beta,beta_var = beta_var)
rownames(summary) = rownames(monocyte.de.markers) ### or the gene id column in the res_DE results
idea <- CreateiDEAObject(summary, annotation_data[,1:10], max_var_beta = 100, min_precent_annot = 0.0025, num_core=10)
set.seed(1)
idea <- iDEA.fit(idea, fit_noGS=FALSE, init_beta=NULL, init_tau=c(-2,0.5), min_degene=5, em_iter=15, mcmc_iter=1000, fit.tol=1e-5, modelVariant = F, verbose=TRUE)
idea <- iDEA.louis(idea) ## 
> head(idea@gsea[,1:2])
                                          annot_id annot_coef
1                    GO_CELLULAR_RESPONSE_TO_LIPID  0.3546823
2                             GO_SECRETION_BY_CELL  0.1117387
3 GO_REGULATION_OF_CANONICAL_WNT_SIGNALING_PATHWAY  0.4934008
4            GO_REGULATION_OF_DEVELOPMENTAL_GROWTH  0.3888290
5        GO_CELLULAR_RESPONSE_TO_EXTERNAL_STIMULUS -0.1486579
6                  GO_ACTIN_FILAMENT_BASED_PROCESS  0.4416084
##### Let's modify the beta to be under the log base
pvalue2 <- monocyte.de.markers$p_val#### the pvalue column
zscore2 <- qnorm(pvalue2/2.0, lower.tail=FALSE) #### convert the pvalue to z-score
beta2 <- log(2^(monocyte.de.markers$avg_log2FC))## effect size under natural logarithm
se_beta2 <- abs(beta2/zscore2) ## to approximate the standard error of beta
beta_var2 = se_beta2^2  ### square
summary2 = data.frame(beta = beta2,beta_var = beta_var2)
rownames(summary2) = rownames(monocyte.de.markers) ### or the gene id column in the res_DE results
idea2 <- CreateiDEAObject(summary2, annotation_data[,1:10], max_var_beta = 100, min_precent_annot = 0.0025, num_core=10)
set.seed(1)
idea2 <- iDEA.fit(idea2, fit_noGS=FALSE, init_beta=NULL, init_tau=c(-2,0.5), min_degene=5, em_iter=15, mcmc_iter=1000, fit.tol=1e-5, modelVariant = F, verbose=TRUE)
idea2 <- iDEA.louis(idea2) ## 
> head(idea2@gsea[,1:2])
                                          annot_id annot_coef
1                    GO_CELLULAR_RESPONSE_TO_LIPID  0.3710055
2                             GO_SECRETION_BY_CELL  0.1170953
3 GO_REGULATION_OF_CANONICAL_WNT_SIGNALING_PATHWAY  0.5483001
4            GO_REGULATION_OF_DEVELOPMENTAL_GROWTH  0.4773585
5        GO_CELLULAR_RESPONSE_TO_EXTERNAL_STIMULUS -0.1952890
6                  GO_ACTIN_FILAMENT_BASED_PROCESS  0.5636364

#### Let's check the estimated posterior inclusion probability (PIP) estimates inferred by the Bayesian model averaging (BMA) approach (details in the paper) across all available gene sets

idea <- iDEA.BMA(idea) ##
idea2<- iDEA.BMA(idea2) ##
> cor(idea@BMA_pip,idea2@BMA_pip)
         BMA_pip
BMA_pip 0.996566
####  so the posterior inclusion probability (PIP)  estimates are quite consistent

As you see, the annotation_coefficient estimates and PIP estimates are consistent, the little variability here might be due to the parameter estimation in the MCMC sampling procedure, but it will not affect the rank of the gene sets from the top enriched gene sets to the bottom ones. I hope this is helpful for you to understand this question! Let me know if you have further questions!