Closed blawney closed 2 years ago
Update:
The most important data is the p-value and the fraction of enriched genes. So let's change the arrangement of the plot to the following:
elim
). Here, we want to show more significant values off to the right. Hence, instead of plotting elim
directly, plot the -log(elim)
. This transformation takes really small values to large values (e.g. 0.00001 goes to 5 and 0.01 goes to 2).significant/annotated
.annotated
). Warmer/hotter colors are for larger sets.Couple other notes:
Near the "Annotated" legend, let's put a tooltip (with the "info" symbol) that says: "The number of annotated genes refers to the total number of genes associated with a particular GO term. Note that the colors are log-scaled for improved dynamic range.
Near the "significant/annotated" legend, the tooltip should say "This provides the fraction of annotated genes which are deemed significant by the selected threshold for differential expression"
Added in 975d8b700e2b5444423dfed3ae48ee0ef0411083
topGO visualization
The goal is to create an interactive plot as an alternative to the table-based view. This type of plot is common in academic publications, so we are following convention here.
Short background
"GO" stands for gene ontology. It's a collection of terms that can be represented as a tree. For instance:
Each of the terms there is a GO "term". They are arranged in a tree hierarchy and are related through their relationships (e.g. "is a" or "has a" relationships). There's a huge online database of all the terms with explanations, etc.
Associated with each GO term are a list of genes/proteins-- that is, we know from experiments that certain genes perform certain functions. The GO databases basically link each term (e.g. cell death) to a list of associated genes/proteins.
The analysis we chose (topGO) attempts find GO terms which might be relevant for a particular dataset. The input is a list of genes, which perhaps originates from a "differential gene expression" analysis examining genes that are different between healthy and cancerous tissue. With that list of genes, we can look through the GO terms and run statistical tests to see if it's probable that the GO term is related. For example, consider a term (e.g. cell death) that has 100 genes associated with it. Out of those 100 genes, if we see 80 that are in our list (from the differential expression analysis), that's probably a sign that something is happening with cell death processes.
A basic visualization
We see plots that typically look like this:
Note that the plot shows BP, CC, and MF which are broad "categories" within GO analysis. We only allow one of these to be run at a time, so we won't need that colored background. If it helps, just concentrate on the tan portion (for BP).
The vertical axis is the GO term. For D3, you'll see things like
d3.scale.ordinal
which will take a list of categories/GO terms and make axes and labels for them.You can see both "number of genes" (the x-axis) and "gene number" (the size of the points). This refers to the number of genes associated with a GO term and the number of genes in your list. For example, assume cell death is associated with 100 genes. In the gene list used as input, 80 of those are "significantly" different between normal and cancer tissue.
The color of the points ("q-value") is a measure of the fraction of genes that are significant. It can be a bit technical, but in addition to knowing the fraction (e.g. 80/100), we have a statistical test that is performed as part of topGO. That number will be presented via color.
Data output
Part of the output passed to the topGO component is a UUID which references a JSON-format file on the backend server. Currently, I use this data to populate the table. The data is formatted as a list of objects where each object looks like:
Note that there are two p-values since there are multiple ways to compute the statistics. More below.
Our plot
Referring to the data fields in the previous section and the plot example above
For the y-axis, we should show the
term
(since no one has things like "GO:0002250" memorized).The x-axis can be populated by
annotated
, the size of the points can be set based onsignificant/annotated
(a number on $[0,1]$), and the color can be set byelim_pval
where smaller values should be "hotter" (e.g. red/yellow since that draws the eye).When the user hovers over the point, we can show more details. For instance, we could show a tooltip with:
As stated, we'll use the
elim
p-value since it tends to separate the points (and for other technical reasons). However, the users may also want to see that classic p-value.We won't show the gene list, but we will have a button (just like the one currently in the table) which will allow users to take that list of genes (
genelist
) and create a "feature/gene set" which they can use elsewhere in WebMeV.