xzhoulab / iDEA

Differential expression (DE); gene set Enrichment Analysis (GSEA); single cell RNAseq studies (scRNAseq)
GNU General Public License v3.0
32 stars 11 forks source link

No significant pathways potentially due to p_values of 0? #14

Closed Tommy0398 closed 3 years ago

Tommy0398 commented 3 years ago

Hi, I've performed DE analysis within seurat on a specific cluster across two conditions and used the DE summary table within iDEA to try and identify certain pathways that are significant using the inbuilt mouse gene sets.

After fitting the model with the R data my results had no significant pathways and I'm worried that this is due to my DE data having p-values that were 0 which then makes the beta_variance for the summary_data also 0.

There were quite alot of DE genes in my summary data so there being no significant pathways or any that were near significant is unexpected.
Do you know if that does cause problems with the analysis?

YingMa0107 commented 3 years ago

Hi @BiostormSilver,

Thanks for your attention to iDEA. If the beta_variance = 0, it will be filtered in the CreateiDEAObject() function when you create an iDEA object.

Could you please provide me more details for me to figure out the issue? For example, which DE software you used for your DE analysis? What does the distribution of your p-value as well as some output of iDEA look like? (If it is allowed to be shared)

## ===== iDEA INPUT SUMMARY ==== ##
## number of annotations:  100 
## number of genes:  15280 
## number of cores:  10 
## fitting the model with gene sets information... 
  |======================================                                       |  50%, ETA 6:30
Tommy0398 commented 3 years ago

Hi, thanks for your reply.

I used the findMarkers function within Seurat to produce the DE table using default parameters. The p values range from 1e-19 to 0 or 1e-301 ignoring the 0's and there are around 1600 DE genes.

The dataset is from mice so I used ideas mice gene set and just followed the tutorial to produce the idea object.

This is the input output:

## ===== iDEA INPUT SUMMARY ==== ##
## number of annotations:  5534
## number of genes:  1198
## number of cores:  2
## fitting the model with gene sets information...

The idea output looks after fitting the model and performing p value correction looks like :

> head(idea@gsea)
    annot_id  annot_coef  annot_var annot_var_louis sigma2_b pvalue_louis
1 GO:0000003 -0.01140289 0.04957866        2.450662 197.5817    0.9941882
2 GO:0000018 -0.04679229 1.33740694       61.282368 198.6804    0.9952308
3 GO:0000041  0.01284539 0.36702093       44.798565 196.8325    0.9984687
4 GO:0000045 -0.02457407 0.28913481       12.753729 196.9754    0.9945097
5 GO:0000049  0.10041847 1.00585228      106.405566 198.1000    0.9922328
6 GO:0000060 -0.02687010 0.44788764       22.693382 198.4677    0.9954995
     pvalue
1 0.9591570
2 0.9677252
3 0.9830835
4 0.9635485
5 0.9202444
6 0.9679736

All the p values are around the same value

YingMa0107 commented 3 years ago

Hi @BiostormSilver, Thank you for providing the details! I was wondering why the number of genes is only 1198 (based on the iDEA input summary). When you used the findMarkers function within Seurat, did you only output the significant genes? Maybe try to output all genes (please refer to this post and set the corresponding parameters in Seurat (https://github.com/satijalab/seurat/issues/2154). iDEA requires all gene level summary statistics not just DE genes'

Tommy0398 commented 3 years ago

Thanks, that does seem to be the problem.

I managed to increase the number of genes to 6000, and it found a few pathways so with more genes it'd probably find more pathways. The majority of genes were still removed in the iDEA object because they had original p_values of 1 that became an Inf beta_var during the zscore <- qnorm(pvalue/2.0, lower.tail=FALSE) (0) se_beta <- abs(beta/zscore) (Inf) calculation.

Is this the right behaviour for calculating beta_var or should I try a workaround to keep those genes in the iDEA object for the calculation?

YingMa0107 commented 3 years ago

Hi @BiostormSilver,

I'm not sure if I understand your case correctly. Did you mean that when you include all genes (DE + non-DE), the total number of gene is 6000? And How many genes have p-vaue = 1 (Based on my previous experience with MAST, there are only few genes with P-value = 1.0)? What is the original dimension of your raw count data?

And which pathway data did you use? (I saw from your previous output that there are 5534 annotations included in the analysis).

If you follow the tutorial to calculate the variance beta, that should be correct. If there are many genes that have p-value = 1.0 and you can get the appropriate standard error for those genes, you can include it., otherwise when you create iDEA object, it will filter out the inf beta_var values.

Tommy0398 commented 3 years ago

Hi @YingMa0107 ,

6000 was the total number of genes that were used within the iDEA object and that was including DE and non-DE genes, although there is only ~600 genes that are not significantly DE that don't have a p_value of 1. By default the seurat DE test uses a Wilcox test so that may give the difference in expected results from MAST.

The original total number I inputted to iDEA was around 19000 and 11000 of those had a p_value of 1 (which then became Inf beta_var following the tutorial).

The original dimensions of my raw count data are : 19000 features across 58219 cells (both conditions combined).

I used the mouseGeneSets as the annotation but I have also tested a gene set from GSEA and that gives a similar result.

YingMa0107 commented 3 years ago

Hi @BiostormSilver,

Based on your description, the summary statistics with a large proportion of p-value = 1.0 (57.8%) does not seem to be normal. I would recommend you to try a different DE method implemented in Seurat such as setting the test.use = "MAST" or test.use = "DESeq2" if time is allowed.

When you input Seurat, the raw dimension is 19000 and 58219, and then when you input iDEA, the number of genes in the summary statistics is still 19000, did you follow Seurat to perform any quality control (i.e. filtered out low quality cells, low quality genes)?

And Did any of the DE genes from Seurat result or the significant gene sets from iDEA make sense to you?

Tommy0398 commented 3 years ago

Hi @YingMa0107

I have tried out using MAST for DE now and it seems to give a similar result to the Wilcox test.

Quality control was performed in Seurat (Unique genes, MT% and totalRNA as well as extra doublet removal), and all the visualizations looked reasonable so the data should be fine. I believe the DE genes make sense for the comparison across WT and KO, the enriched gene sets were fairly generic so I can't draw many conclusions from them.

This is a comparison of the same cell type across two conditions so that may explain the similarity of gene expression for most of those genes for them to give a p_value of 1, however I don't have a lot of experience comparing single cell data so I can't say anything for certain.