xzhoulab / iDEA

Differential expression (DE); gene set Enrichment Analysis (GSEA); single cell RNAseq studies (scRNAseq)
GNU General Public License v3.0
32 stars 11 forks source link

How can we draw the result into a figure? #1

Closed zh-Bian closed 3 years ago

zh-Bian commented 4 years ago

After we bulit the iDEA model uesd the data, there is gsea result in the iDEA@gsea. Could we draw them into a figure but not write in a table? I tried to used the other function but the structure of the object is different. So I download the program in the https://github.com/xzhoulab/iDEA-Analysis , but it did not tell us how to draw the gsea result into a figrue. So I turn to you for help, could you give us an example ahout how to make the result into the figure? Thank you.! @sqsun @YingMa1993

YingMa0107 commented 4 years ago

Hi,

Thanks for your interest in our package iDEA! Did you mean that you want to draw the idea@gsea results into a figure like bubble plot in the iDEA paper i.e. Figure 4e,5e,6e? If so, I've uploaded one example plot data here https://github.com/xzhoulab/iDEA-Analysis/tree/master/figures. You can follow the example code bubble_plot.R to generate the bubble plot.

Besides that, here is the example code for generating the plot data frame

-------------------------------------------------------------

create the plot data frame for bubble plot

-------------------------------------------------------------

Extract the category information for the human gene sets

library(iDEA)

you can download the data here https://github.com/xzhoulab/iDEA/tree/master/data

or extract it from the updated package.

data(humanGeneSetsInfo) Category = humanGeneSetsInfo[,c(4:5)] colnames(Category) = c("Category","annot_id") count = data.frame(annot_id = names(idea@annotation),Count = unlist(lapply(1:length(idea@annotation),function(i){length(idea@annotation [[i]])})))

merge the data with gsea results of iDEA

df = merge(idea@gsea,count,by = "annot_id") df = merge(df,Category,by = "annot_id")

create plotdata

plotdata = data.frame(IDNum = c(1:nrow(df)), ID = df$annot_id, Term = df$annot_id, Category = df$Category, Count = df$Count, Pvalue_Louis = idea@gsea$pvalue_louis, Log10_Pvalue_Louis = -log(idea@gsea$pvalue_louis,base = 10)) CateoryLevels = c("IMMUNOLOGIC SIGNATURES", "CHEMICAL AND GENETIC PERTURBATIONS", "GO BIOLOGICAL PROCESS", "GO MOLECULAR FUNCTION","GO CELLULAR COMPONENT","ONCOGENIC SIGNATURES","REACTOME","KEGG","PID","BIOCARTA") plotdata = plotdata[order(match(plotdata$Category, CateoryLevels)),]

The following code are just the same as the code in https://github.com/xzhoulab/iDEA-Analysis/blob/master/figures/bubble_plot.R

select the top biologically significant gene set in the bubbleplot to show

SigSelectTem = plotdata$Term[order(plotdata$Pvalue_Louis,decreasing = F)[1:5]] Sig = plotdata[which(plotdata$Term %in% SigSelectTem),]

you can modify your text label here .i.e. not capitalize the gene set term

Sig$Term = tolower(Sig$Term)

-------------------------------------------------------------

ggplot function for bubble plot

-------------------------------------------------------------

library(ggplot2) library("ggrepel") p4 = ggplot(plotdata, aes(x = IDNum, y = Log10_Pvalue_Louis,color =Category))+ geom_point(shape = 19,aes(fill = Category, size = Count,shape =19),alpha=0.8)+ scale_radius()+ labs(x = "", y =expression(paste(bold(-log[10]),bold("("),bolditalic(p),bold("-value)"))))+ theme(plot.margin = margin(1, 1, 1, 1, "cm"), axis.text.x=element_blank(), plot.title = element_text(lineheight=.8, face="bold"), axis.text = element_text(size = 30), axis.line = element_line(colour = 'black'), axis.ticks = element_line(colour = 'grey80'), axis.title = element_text(size = 40, face = 'bold'), axis.title.y = element_text(margin = margin(t = 0, r = 20, b = 0, l = 0)), legend.title=element_text(size=24,face = 'bold'), legend.text=element_text(size=24), panel.background = element_blank(), panel.grid.minor = element_blank(), panel.grid.major = element_line(colour = 'white'), plot.background = element_blank(), legend.key = element_rect(color = "transparent", fill = "transparent"))+ geom_hline(yintercept = 1.82, col = 'black',linetype = 2,size=2)+ guides(color = guide_legend(order = 1,override.aes = list(alpha =1,size=7)), size = guide_legend(order = 2,override.aes = list(alpha = 1,shape=21)),fill = FALSE)+ labs(size = "Gene set size")+ scale_color_manual(values=c("salmon","gold2","#42d4f4","#3cb44b","chocolate2","#4363d8","#bfef45","#911eb4","#f032e6","#a9a9a9"))+ scale_fill_manual(values = c("salmon","gold2","#42d4f4","#3cb44b","chocolate2","#4363d8","#bfef45","#911eb4","#f032e6","#a9a9a9"))+ theme(legend.direction = "vertical")+ theme(legend.position = c(0.30, 0.85))+ theme(legend.box = "horizontal")+ theme(legend.title.align = 0)+ geom_text_repel( data = Sig, aes(label = Term), force=1.0, point.padding=unit(1.1,'lines'), box.padding = unit(0.1, "lines"), vjust=-1.4, hjust = 0.2, size = 8, direction='y', nudge_x=0.2, nudge_y = 0.2, segment.size=0.2, col = "black")

-------------------------------------------------------------

Write out the .png file

-------------------------------------------------------------

png("~/Figure5e.png",width = 12500, height = 7500, res = 600) p4 dev.off()

You may have to change the legend.position according to the number of gene sets you tested and the width and height of the figure you set. And all the parameters in the theme can be modified by yourself. Hope it helps!

Best, Ying

On Sun, May 3, 2020 at 8:59 PM zh-Bian notifications@github.com wrote:

After we bulit the iDEA model uesd the data, there is gsea result in the iDEA@gsea. Could we draw them into a figure but not write in a table? I tried to used the other function but the structure of the object is different. So I download the program in the https://github.com/xzhoulab/iDEA-Analysis , but it did not tell us how to draw the gsea result into a figrue. So I turn to you for help, could you give us an example ahout how to make the result into the figure? Thank you.! @sqsun https://github.com/sqsun @YingMa1993 https://github.com/YingMa1993

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xzhoulab/iDEA/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QA2ISK4NL3M45DPDRYE3RPYHPVANCNFSM4MYLXUZQ .

-- Ying Ma Ph.D. Candidate Department of Biostatistics,School of Public Health. University of Michigan, Ann Arbor.

zh-Bian commented 4 years ago

Thank you very much that I can draw my gsea result into a figure in your paper.Thank you again. And I also notice that in the 'mouseGeneSets', the rowname is MGI, but not the GO or SYMBOL. I'm a little curious that why not use the SYMBOL but MGI. I download the GAF 2.0 format form the website you had provided(http://www.informatics.jax.org/downloads/reports/index.html#go). I found the relationship between the MGI and GO is not one-to-one. One SYMBOL could correspond to more than one GO, however these GO were not correspond to one MGI. 微信截图_20200505110528

Whether such a transformation relationship will produce errors ? So I am a little curious why not use the SYMBOL name in the mouseGeneSet like the humanGeneSet ? Thank you so much for teaching me how to draw the result into a figure. Thank you! @YingMa1993

Best, Bian

YingMa0107 commented 4 years ago

Hi,

You are vey welcome! The MGI ID here is just from the mouse gene ontology terms we downloaded. Basically, you can convert them to gene symbol by either using the R package "biomaRt" or the website here http://www.informatics.jax.org/batch/summary. GO is short for gene ontology annotations here, one of its aim is to annotate genes and gene products. Thus, genes and gene products can be annotated to one GO term or multiple GO terms. The MGI ID here is the ID for the gene level, the Gene Ontology terms are for the gene set level.

Best, Ying

princy149 commented 4 years ago

hi, could you tell me about annotation data, used in iDEA, how you prepare this, actually I found in this 100 columns for GO, how can i prepare this type of annotation data for my summary data ? pleases suggest me.

YingMa0107 commented 4 years ago

Hi, We have provided the gene sets data in our package. i.e. library(iDEA) data(humanGeneSets) data(mouseGeneSets) More details are in the tutorial https://xzhoulab.github.io/iDEA/, in the section Required input data.

princy149 commented 4 years ago

Thank you very much for your prompt reply and proving us iDEA, can I run iDEA for my summary data for Cancer single cell RNAseq data using your annotation_data ? or I need to prepare new annotation data using your given human gene sets?(In this case, how you prepare annotation_data, like a matrix ?

YingMa0107 commented 4 years ago

Hi, You are very welcome! You can run iDEA for your summary statistics for your own data, as long as your summary statistics satisfy the required format of input as the tutorial. And the gene names of your summary statistics should be matched with the gene name in the human gene sets. It also really depends on which gene sets you want to analyze, here, in our human gene sets, we have provided gene sets come from databases i.e. BioCarta, KEGG, GO, Reactome... There are also other sources of gene annotation data. As for your second question, when you load the human gene sets, i.e. data(humanGeneSets), it is already in the matrix format.

princy149 commented 4 years ago

I really appreciate all your help. Actually we are in middle of revising the manuscript for which I want to use iDEA, and your prompt reply help us lot.now I successfully run my summary statistics data to with given G0 annotation data, now i m trying for BIOCARTA, KEGG and PID but during running of this i got some errors, 1)I checked the genesets as:

unique(sapply(X=strsplit(colnames(humanGeneSets), split = "_"), FUN = "[", 1))    [1] "NAKAMURA"             "WEST"                 "WINTER"               "PARENT"               "PYEON"                "PICCALUGA"              [7] "LU"                   "KORKOLA"              "WATANABE"             "HOLLMANN"             "LIU"                  "ONKEN"                 [13] "BERTUCCI"             "SCHUETZ"              "DAVICIONI"            "FOURNIER"             "FRASOR"               "KIM"                   [19] "SENGUPTA"             "SOTIRIOU"             "GAZDA"                "CHEMNITZ"             "IGARASHI"             "ZHONG"                 [25] "TURASHVILI"           "CHANDRAN"             "WILCOX"               "FULCHER"              "GARY"                 "ZHOU"                   [31] "SAMOLS"               "HOOI"                 "PRAMOONJAGO"          "PUIFFE"               "THUM"                 "CASORELLI"             [37] "CORRE"                "HUTTMANN"             "DIAZ"                 "DEURIG"               "CHARAFE"              "DOANE"                 [43] "WANG"                 "LAIHO"                "BORCZUK"              "ROY"                  "GRABARCZYK"           "NEWMAN"                 [49] "HORIUCHI"             "GAL"                  "BASAKI"               "NOJIMA"               "WIKMAN"               "RODRIGUES"             [55] "VECCHI"               "BARRIER"              "SMIRNOV"              "SLEBOS"               "JAEGER"               "OSMAN"                 [61] "GINESTIER"            "OSWALD"               "GARGALOVIC"           "TAKEDA"               "BOGNI"                "BILBAN"                 [67] "HOEBEKE"              "PEPPER"               "AKL"                  "RHEIN"                "MULLIGHAN"            "TONKS"                 [73] "PAPASPYRIDONOS"       "DITTMER"              "UDAYAKUMAR"           "ODONNELL"             "SENESE"               "LEE"  .......

after this I select the KEGG related pathway in the given humanGenesets

as: sampleFiles.KEGG <- grep("KEGG",colnames(humanGeneSets),value=TRUE)

the prepare sub matrix for KEGG:

matrix.KEGG <-humanGeneSets[ , sampleFiles.KEGG]

idea.BIOCARTA <- CreateiDEAObject(summary, matrix.BIOCARTA, max_var_beta = 100, min_precent_annot = 0.0025, num_core=10) idea.BIOCARTA <- iDEA.fit(idea.BIOCARTA,

  • fit_noGS=FALSE,
  • init_beta=NULL,
  • init_tau=c(-2,0.5),
  • min_degene=5,
  • em_iter=15,
  • mcmc_iter=1000,
  • fit.tol=1e-5,
  • modelVariant = F,
  • verbose=TRUE)

    ===== iDEA INPUT SUMMARY ====

    number of annotations:  39

    number of genes:  1885

    number of cores:  1

    fitting the model with gene sets information...

      |=================================================================================  |  97%, ETA 00:12 warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

warning: inv_sympd(): given matrix is not symmetric

error: inv_sympd(): matrix is singular or not positive definite Error in EMMCMCStepSummary(object@summary[, 1], object@summary[, 2], as.matrix(Annot),  :   inv_sympd(): matrix is singular or not positive definite   |===============================================================================| 100%, Elapsed 07:49

how can i remove this errors.....

Thank you.

YingMa0107 commented 4 years ago

Hi,

This is caused by the problem of calculating the matrix inverse in the algorithm, normally when the summary statistics are appropriate and the coverage rate of the gene set is not extremely small, it is stable. We mentioned this in the manuscript discussion part. Based on your error, I just substitute the inverse function with generalized inverse function. And also tested the example summary statistics with the same number of genes (eg. 1885 genes just liker yours, or less 500 genes) using BIOCARTA gene sets. It can run smoothly.

Please update the package and re-run it again to see if it helps. If it does not help, could you please tell me more details about your summary statistics input and then we can try to figure out why you have this issue.

Best, Ying

princy149 commented 4 years ago

Hi,

Thank you for your help.