web-mev / mev-frontend

The front-end Angular 12 application for the MEV web application
MIT License
3 stars 0 forks source link

Create bubble plot for topGO output #30

Closed blawney closed 2 years ago

blawney commented 2 years ago

topGO visualization

The goal is to create an interactive plot as an alternative to the table-based view. This type of plot is common in academic publications, so we are following convention here.

Short background

"GO" stands for gene ontology. It's a collection of terms that can be represented as a tree. For instance:

go_tree

Each of the terms there is a GO "term". They are arranged in a tree hierarchy and are related through their relationships (e.g. "is a" or "has a" relationships). There's a huge online database of all the terms with explanations, etc.

Associated with each GO term are a list of genes/proteins-- that is, we know from experiments that certain genes perform certain functions. The GO databases basically link each term (e.g. cell death) to a list of associated genes/proteins.

The analysis we chose (topGO) attempts find GO terms which might be relevant for a particular dataset. The input is a list of genes, which perhaps originates from a "differential gene expression" analysis examining genes that are different between healthy and cancerous tissue. With that list of genes, we can look through the GO terms and run statistical tests to see if it's probable that the GO term is related. For example, consider a term (e.g. cell death) that has 100 genes associated with it. Out of those 100 genes, if we see 80 that are in our list (from the differential expression analysis), that's probably a sign that something is happening with cell death processes.

A basic visualization

We see plots that typically look like this:

example

Data output

Part of the output passed to the topGO component is a UUID which references a JSON-format file on the backend server. Currently, I use this data to populate the table. The data is formatted as a list of objects where each object looks like:

{
    "go_id":"GO:0002250",
    "term":"adaptive immune response",
    "annotated":604,
    "significant":297,
    "expected":196.3,
    "fisher_rank":1,
    "classic_pval":3.7e-18,
    "elim_pval":7.4e-14,
    "genelist":["ENSG00000170017",...,"ENSG00000187997"]
}

Note that there are two p-values since there are multiple ways to compute the statistics. More below.

Our plot

Referring to the data fields in the previous section and the plot example above

For the y-axis, we should show the term (since no one has things like "GO:0002250" memorized).

The x-axis can be populated by annotated, the size of the points can be set based on significant/annotated (a number on $[0,1]$), and the color can be set by elim_pval where smaller values should be "hotter" (e.g. red/yellow since that draws the eye).

When the user hovers over the point, we can show more details. For instance, we could show a tooltip with:

As stated, we'll use the elim p-value since it tends to separate the points (and for other technical reasons). However, the users may also want to see that classic p-value.

We won't show the gene list, but we will have a button (just like the one currently in the table) which will allow users to take that list of genes (genelist) and create a "feature/gene set" which they can use elsewhere in WebMeV.

blawney commented 2 years ago

Update:

The most important data is the p-value and the fraction of enriched genes. So let's change the arrangement of the plot to the following:

blawney commented 2 years ago

Couple other notes:

blawney commented 2 years ago

Added in 975d8b700e2b5444423dfed3ae48ee0ef0411083