Create bubble plot for topGO output

blawney commented 2 years ago

topGO visualization

The goal is to create an interactive plot as an alternative to the table-based view. This type of plot is common in academic publications, so we are following convention here.

Short background

"GO" stands for gene ontology. It's a collection of terms that can be represented as a tree. For instance:

go_tree

Each of the terms there is a GO "term". They are arranged in a tree hierarchy and are related through their relationships (e.g. "is a" or "has a" relationships). There's a huge online database of all the terms with explanations, etc.

Associated with each GO term are a list of genes/proteins-- that is, we know from experiments that certain genes perform certain functions. The GO databases basically link each term (e.g. cell death) to a list of associated genes/proteins.

The analysis we chose (topGO) attempts find GO terms which might be relevant for a particular dataset. The input is a list of genes, which perhaps originates from a "differential gene expression" analysis examining genes that are different between healthy and cancerous tissue. With that list of genes, we can look through the GO terms and run statistical tests to see if it's probable that the GO term is related. For example, consider a term (e.g. cell death) that has 100 genes associated with it. Out of those 100 genes, if we see 80 that are in our list (from the differential expression analysis), that's probably a sign that something is happening with cell death processes.

A basic visualization

We see plots that typically look like this:

example

Note that the plot shows BP, CC, and MF which are broad "categories" within GO analysis. We only allow one of these to be run at a time, so we won't need that colored background. If it helps, just concentrate on the tan portion (for BP).
The vertical axis is the GO term. For D3, you'll see things like d3.scale.ordinal which will take a list of categories/GO terms and make axes and labels for them.
You can see both "number of genes" (the x-axis) and "gene number" (the size of the points). This refers to the number of genes associated with a GO term and the number of genes in your list. For example, assume cell death is associated with 100 genes. In the gene list used as input, 80 of those are "significantly" different between normal and cancer tissue.
- For our plot, we can do the "total" genes on the x-axis (e.g. the 100 for cell death)
- Instead of raw counts for "significant" genes in that GO term (e.g. 80), we can make the size of the points as a fraction of the total (e.g. 0.8). This makes things more comparable since each GO term can have very different numbers of total genes. This way, the user's eye will be drawn to larger points which represent a larger fraction of "significant" genes.
The color of the points ("q-value") is a measure of the fraction of genes that are significant. It can be a bit technical, but in addition to knowing the fraction (e.g. 80/100), we have a statistical test that is performed as part of topGO. That number will be presented via color.

Data output

Part of the output passed to the topGO component is a UUID which references a JSON-format file on the backend server. Currently, I use this data to populate the table. The data is formatted as a list of objects where each object looks like:

{
    "go_id":"GO:0002250",
    "term":"adaptive immune response",
    "annotated":604,
    "significant":297,
    "expected":196.3,
    "fisher_rank":1,
    "classic_pval":3.7e-18,
    "elim_pval":7.4e-14,
    "genelist":["ENSG00000170017",...,"ENSG00000187997"]
}

Note that there are two p-values since there are multiple ways to compute the statistics. More below.

Our plot

Referring to the data fields in the previous section and the plot example above

For the y-axis, we should show the term (since no one has things like "GO:0002250" memorized).

The x-axis can be populated by annotated, the size of the points can be set based on significant/annotated (a number on $[0,1]$), and the color can be set by elim_pval where smaller values should be "hotter" (e.g. red/yellow since that draws the eye).

When the user hovers over the point, we can show more details. For instance, we could show a tooltip with:

Go ID
term (again)
annotated
significant (and show them the number significant/annotated)
Classic p-val
Elim method p-val

As stated, we'll use the elim p-value since it tends to separate the points (and for other technical reasons). However, the users may also want to see that classic p-value.

We won't show the gene list, but we will have a button (just like the one currently in the table) which will allow users to take that list of genes (genelist) and create a "feature/gene set" which they can use elsewhere in WebMeV.

blawney commented 2 years ago

Update:

The most important data is the p-value and the fraction of enriched genes. So let's change the arrangement of the plot to the following:

on the horizontal/x-axis let's plot the p-value (elim). Here, we want to show more significant values off to the right. Hence, instead of plotting elim directly, plot the -log(elim). This transformation takes really small values to large values (e.g. 0.00001 goes to 5 and 0.01 goes to 2).
For the size of the points, let's use the fraction of enriched genes significant/annotated.
Make the color based on the total size of the set (annotated). Warmer/hotter colors are for larger sets.

blawney commented 2 years ago

Couple other notes:

Near the "Annotated" legend, let's put a tooltip (with the "info" symbol) that says: "The number of annotated genes refers to the total number of genes associated with a particular GO term. Note that the colors are log-scaled for improved dynamic range.
Near the "significant/annotated" legend, the tooltip should say "This provides the fraction of annotated genes which are deemed significant by the selected threshold for differential expression"

blawney commented 2 years ago

Added in 975d8b700e2b5444423dfed3ae48ee0ef0411083

web-mev / mev-frontend