xzhoulab / iDEA

Differential expression (DE); gene set Enrichment Analysis (GSEA); single cell RNAseq studies (scRNAseq)
GNU General Public License v3.0
32 stars 11 forks source link

Question about creating a gene set database #16

Closed stanaka6 closed 3 years ago

stanaka6 commented 3 years ago

Hi iDEA team,

Thank you for the great package. Could you please tell me or provide a sample code for how to create a gene set database (gene specific annotation files in the tutorial) from gaf or gmt file, or simply from a set of gene lists, from databases (GO, KEGG, etc.)? I want to create a zebrafish gene set database to perform iDEA. I understand that the row names are genes and column names are annotation names/gene set names. I might miss some info, but I would be grateful if you could explain the meaning of the number (0 or 1?) in each row.

Best,

YingMa0107 commented 3 years ago

Hi @stanaka6,

Thanks for your attention to iDEA. Here the 0/1 means that if the gene is included in the corresponding gene set, it is coded as 1, otherwise coded as 0. Here are some example code for your to convert the gmt format to annotation matrix that can be used for iDEA. I downloaded the gmt example data from [http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp]:

#### please install this library if you haven't installed yet
#### load the library
library(qusage)
#### read in the gmt for mated gene list, e.g., c2.cp.v7.4.symbols.gmt
exampleSet = read.gmt("./c2.cp.v7.4.symbols.gmt")
#### All genes you want to analyzed, here I use all genes that exist in the c2 pathway as an example. I recommend you to use the union of the genes in your summary statistics and the gene sets you want to analyzed. 
AllGenes = unique(unlist(exampleSet))
print(length(AllGenes))
## 13351
#### make the annotation
Annotation = sapply(1:length(exampleSet),function(iSet){
##### if the gene is included in the ith set 
as.numeric(AllGenes %in% exampleSet[[iSet]])})
#### give the column names and row names of the annotation
rownames(Annotation) = AllGenes
colnames(Annotation) = names(exampleSet)
print(Annotation[1:5,1:5])
      BIOCARTA_GRANULOCYTES_PATHWAY BIOCARTA_LYM_PATHWAY BIOCARTA_BLYMPHOCYTE_PATHWAY BIOCARTA_CARM_ER_PATHWAY BIOCARTA_LAIR_PATHWAY
CXCL8                             1                    1                            0                        0                     1
IFNG                              1                    0                            0                        0                     0
IL1A                              1                    1                            0                        0                     1
CSF3                              1                    0                            0                        0                     0
SELP                              1                    0                            0                        0                     1

Hope this is helpful! Let me know if you have further questions!

stanaka6 commented 3 years ago

Hi @YingMa0107,

Ah, that is a binary file. Thank you so much for sharing the code! I was able to convert my gene sets to that style.

I have a question regarding your comment in the R script:

here I use all genes that exist in the c2 pathway as an example. I recommend you to use the union of the genes in your summary statistics and the gene sets you want to analyzed.

Did you mean that it is better to filter the genes in the annotation file (like, removing the genes that are NOT in the gene set from my summary stat)? If so, could you please explain the reason? For the efficiency of analysis?

Thanks again!

YingMa0107 commented 3 years ago

Hi @stanaka6,

Here in the comment I mean when you create the binary annotation matrix from yourself, you should use the union of the genes in your summary statistics and the gene sets you want to analyzed (here the union is not intersection, it means the genes that include both the genes existing in the summary statistics and in the annotation file). Thus, in this way, the initial input of annotation file contains all genes information. I did not mean to filter any genes.

I added this comment because I could not have access to your summary statistics, so I can only use all genes in the C2 pathway as an example to show you how to create the annotation matrix. But the C2 pathway only contains ~13k genes, less than the number of genes in a common summary statistics file.

Let me know if this answers your question!

stanaka6 commented 3 years ago

Hi @YingMa0107,

Thank you so much for your answer. If I understand correctly... if there are genes in my summary statistics, but not in the annotation matrix, I need to add those genes to the annotation matrix. Am I understanding right?

For example: stat: my DE summary statistics GO.annotation: my binary annotation matrix

length(rownames(GO.annotation)) 
## [1] 17799
# my GO annotation matrix has 17799 genes

length(rownames(stat))
## [1] 16075
# my stat summary contains 16075 genes

# Extract genes
GO_genes <- rownames(GO.annotation) 
stat.genes <- rownames(stat)

# Find genes that exist only in summary stat, but NOT in the annotation file 
GO.diff <- setdiff(stat.genes, GO.genes)
length(GO.diff)
## [1] 4399
# There are 4399 genes, which are not in the annotation file

# Create a binary matrix for those 4399 genes 
GO.mat <- matrix(0, length(GO.diff),length(colnames(GO.annotation))) 
rownames(GO.mat) <- GO.diff

# Add created matrix to the original annotation matrix 
GO.annotation_new <- rbind(GO.annotation, GO.mat)
length(rownames(GO.annotation_new))
# [1] 22198

So now, is this "GO.anotation_new" ready for iDEA analysis?

Noted that I generated my summary statistics by following Seurat FindMarkers function (I haven't calculated standard errors for running iDEA):

stat <- FindMarkers(se,
                    ident.1 = "1_conditionA",
                    ident.2 = "1_conditionB",
                    test.use = "MAST",
                    min.cells.group = 1,
                    min.cells.feature = 1,
                    min.pct = 0,
                    logfc.threshold = 0,
                    only.pos = FALSE)

I really appreciate your help!

YingMa0107 commented 3 years ago

Hi @stanaka6,

Yes, now your GO.anotation_new should be correct for iDEA analysis!

For your convenience, I copied the code that was used to calculate the beta_var here:

# Assume you have obtained the MAST's results from Seurat, "stat", which contains the column Pr(>Chisq) or LogFC. 
res_mast = stat
pvalue <- res_mast$`Pr(>Chisq)` #### the pvalue column
zscore <- qnorm(pvalue/2.0, lower.tail=FALSE) #### convert the pvalue to z-score
beta <- res_mast$LogFC ## effect size
se_beta <- abs(beta/zscore) ## to approximate the standard error of beta
beta_var = se_beta^2  ### beta variance 
summary = data.frame(beta = beta,beta_var = beta_var) ## the first column should be beta and the second should be beta_var
## add the gene names as the rownames of summary
rownames(summary) = rownames(res_mast) ### or the gene id column in the res_mast results

Then you can use the summary matrix and the binary GO.annotation_new matrix as the input for CreateiDEAObject() function and follow the tutorial to run iDEA analysis.

stanaka6 commented 3 years ago

Hi @YingMa0107,

I just want to make sure the one more thing about the calculation. Should I use unadjusted p-values (p_val)?

My summary statistics from Seurat FindMarkers function with MAST looks like this:

head(stat)
#        X        p_val  avg_log2FC pct.1 pct.2    p_val_adj
# 1   ARF5 3.336741e-67 -1.50927984 0.973 0.979 5.363812e-63
# 2    gh1 6.909086e-55 -2.65588280 0.960 0.995 1.110636e-50
# 3  rpl26 8.810527e-49 -0.75162853 0.848 0.843 1.416292e-44
# 4  tshba 3.537008e-36 -1.01662695 0.473 0.784 5.685740e-32
# 5 ahcyl1 1.014933e-29  0.07827821 0.670 0.352 1.631506e-25
# 6   rbmx 1.270280e-29 -0.07157491 0.540 0.311 2.041975e-25

p_val_adj is adjusted p-values based on Bonferroni correction. If I understand correctly, in MAST's result, Pr(>Chisq) represents unadjusted p-values.

Thanks!

YingMa0107 commented 3 years ago

Hi @stanaka6,

Yes, you should calculate the beta_var based on the unadjusted p-value since that is related to the original test statistics.

stanaka6 commented 3 years ago

I see. Thank you so much for all your help, @YingMa0107 . I really appreciate it!

stanaka6 commented 3 years ago

Hi @YingMa0107,

I am sorry for spamming you, but I have questions about data interpretation. My purpose for using iDEA is to see which pathways are enriched in condition A when comparing condition B. So, I would like to illustrate a bubble plot (like in your paper) for showing this result. My questions are:

  1. For the purpose above, do I need to select "positively upregulated" pathways? If so, should I use the pathways with positive anno_coef (coefficient) in the output from iDEA.louis?
  2. Is it better to select positively changed genes when I make a summary statistics from Seurat FindMarkers? In this post, there is a suggestion that we should get kind of all genes' summary statistics. So, in this time, I used all genes including both up and down-regulated in condition A compared to condition B (with setting only.pos = FALSE when I ran FindMarkers).

Any thought would be appreciated!

vravik commented 3 years ago

Hi @YingMa0107 , I have a similar question. I am having trouble interpreting the results of iDEA. It is unclear what annot_coef for the genesets refers to. I am also interested in knowing pathways that are significantly up or downregulated between conditions (enriched in the upreg- or downreg genes), and at the moment it is not evident how I could do this. Any guidance is much appreciated.

YingMa0107 commented 3 years ago

Hi @stanaka6 and @vravik,

From iDEA's model perspective, the anno_coef which is the tau parameter in iDEA's model (Details see Methods section if you are interested in) is defined as a gene set enrichment parameter that determines the odds ratio of DE for genes inside the gene set versus genes outside the gene set. That is, iDEA cannot tell if the gene set is upregulated/downregulated in a specific condition, the sign of annot_coef only determines whether DE genes are enriched or depleted in the gene set.

And for @stanaka6 question 1, you can use select pathways with positive anno_coef and significant pvalue_louis. For your question 2, you still need to use all genes because iDEA models all genes jointly, but you need to pay attention to the definition of the annot_coef that the sign of it only determines whether the DE genes are enriched or not.

stanaka6 commented 3 years ago

I see. Very clear. Thank you very much for your help, @YingMa0107 !

Best,