x-atlas-consortia / ubkg-api

A web service for the Unified Biomedical Knowlege Graph system
MIT License
0 stars 0 forks source link

Add secondary assay types for SenNet #26

Closed maxsibilla closed 1 year ago

maxsibilla commented 1 year ago

Currently in SenNet we only support primary data types but we need to add additional secondary data types that are supported by the dataset pipeline processing. Some of these secondary data types include "snRNAseq-10xGenomics-v3" and salmon_sn_rnaseq_10x".

My proposal is to support all data types supported by HuBMAP where primary is false but need further confirmation on this.

AlanSimmons commented 1 year ago

@maxsibilla @shirey

A number of the HuBMAP secondary assay types in the assay_types.yaml file are artifacts that have been deprecated (by which I mean "no longer present when you search the provenance database").

I recommend confirming the pipelines from HuBMAP that SenNet has stated explicitly that it will support and add data on these assays to the SenNet ontology. Otherwise, we may just be copying over old HuBMAP cruft.

maxsibilla commented 1 year ago

@AlanSimmons thank you for pointing that out, yes we should definitely then get a more fine-grained list

AlanSimmons commented 1 year ago

I asked Brendan to identify the set of pipelines that are common to HuBMAP and SenNet.

AlanSimmons commented 1 year ago

After the April 26 CODCC meeting, we agreed to add all current pipelines information from the HuBMAP application ontology to the SenNet application ontology.

AlanSimmons commented 1 year ago

I added to the SENNET ontology information on all datasets (assay types), both primary and derived, from HuBMAP.

AlanSimmons commented 1 year ago

Validated additions using local instance of neo4j. CSVs submitted for update to prod instances.

maxsibilla commented 1 year ago

I added to the SENNET ontology information on all datasets (assay types), both primary and derived, from HuBMAP.

@AlanSimmons I don't think all primary assay types should be added from HuBMAP since we have already gone through the process of curating which ones SenNet will support. The current ~17 primary assay types are all we need. Adding additional ones will modify the UI in the portal.

AlanSimmons commented 1 year ago

@maxsibilla I removed from the SenNet ontology all primary and derived assay types except for the following:

Primary Secondary
Bulk RNA-seq salmon_rnaseq_bulk
CITE-seq  
CODEX codex_cytokit
CODEX codex_cytokit_v1
CosMX (RNA)  
DBiT-seq  
FACS - Fluorescence-activated Cell Sorting  
LC-MS  
Lightsheet  
Mint-ChIP  
SASP  
Stained Slides  
Visium  
bulk-RNA salmon_rnaseq_bulk
scRNA-seq salmon_rnaseq_10x
snATAC-seq sn_atac_seq
snRNA-seq  
bhonick commented 1 year ago

Mappings from assay_types.yaml in HuBMAP's search-api repo: https://docs.google.com/spreadsheets/d/1sEfBm5P4VaNujVLxGHvOpy0qaccqfEwQ1PbDgc4TtwU/edit#gid=0

bhonick commented 1 year ago

An idea for Alan's table. The precise syntax is up to him.

Primary: snATACseq -> Secondary: snATAC-seq [SnapATAC] -- snapatac_atacseq? Primary: snRNAseq -> snRNAseq [Salmon] -- salmon_rnaseq?

maxsibilla commented 1 year ago

@AlanSimmons did salmon_rnaseq_10x make it into the update CSVs?

AlanSimmons commented 1 year ago

@maxsibilla @shirey @bhonick

Not yet. I need an answer to a question, first.

Statement of problem

salmon_rnaseq_10x is the assay_type for datasets that display in Data Portal as scRNA-seq (10x Genomics) [Salmon]. In HuBMAP, this dataset type is derived from one of the following primary assay data types:

assay_type Display name in Portal Example
scRNAseq-10xGenomics-v2 scRNA-seq (10x Genomics v2) HBM493.GFHZ.686
scRNAseq-10xGenomics-v3 scRNA-seq (10x Genomics v3) HBM233.CCCX.767

You can see full results by running a search with the request body at the bottom of this message. In the HuBMAP ontology, the salmon_rnaseq_10x assay_type has the relationship _is_derivedfrom with the primary assay types listed above. This means that the application ontology associates a derived dataset type with a primary dataset type.

Currently, the two primary assay types from HuBMAP listed above are not in the SenNet ontology. The closest assay_type is scRNA-seq. This was in the original assay_types.yaml file for SenNet but not in the HuBMAP assay_types.yaml, not even as an alt-name.

It looks like scRNA-seq is a new assay type for SenNet, different from the scRNA-seq data types in HuBMAP.

We can associate salmon_rnaseq_10x with scRNA-seq in the SenNet ontology and remove the associations with the scRNAseq-10xGenomics-v2 and scRNAseq-10xGenomics-v3 from HuBMAP. However, if any SenNet primary assays are associated with scRNAseq-10xGenomics-v2 or scRNAseq-10xGenomics-v3, we would need to include both of these in SenNet, too.

The question for somebody (PSC? CMU?) is:

Will the scRNAseq-10xGenomics-v2 or scRNAseq-10xGenomics-v3 assay types be used as data_type for primary datasets in SenNet?


https://search.api.hubmapconsortium.org/v3/search

{
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase": {
                        "data_types.keyword":"salmon_rnaseq_10x"
                    }
                }
            ]
        }
    },
    "aggs": {
        "fieldvals": {
            "terms": {
                "field": "ancestors.data_types.keyword",
                "size": 60
            }
        }
    }
}
AlanSimmons commented 1 year ago

@SamSedivy FYI

AlanSimmons commented 1 year ago

@maxsibilla For now, the salmon_rnaseq_10x derived dataset type will be associated with the scRNA-Seq primary dataset type. There is a potential issue with the display name for the dataset in SenNet.

In HuBMAP, the display name of the dataset with data type salmon_rnaseq_10x is scRNA-seq (10x Genomics) [Salmon]. The display name implicitly indicates that the associated primary dataset is a 10x Genomics dataset--i.e., datasets with one of the following display names:

The application ontology for SenNet currently does not include a primary dataset with one of these display names. Instead, SenNet contains a primary dataset with data type scRNA-seq and display name scRNA-seq. In other words, the primary dataset will not have a display name that includes the term '10x'.

It is, of course, possible to change the display names of items in the SenNet ontology--i.e., to change the display name of the SenNet dataset with type salmon_rnaseq_10x to just scRNA-seq [Salmon]. If we assume that all scRNA-seq assays are based in 10x, then this would not be a problem.

AlanSimmons commented 1 year ago

Changes published to prod.