x-atlas-consortia / ubkg-neo4j

A container implementation to serve the Unified Biomedical Knowledge Graph in Neo4j
MIT License
1 stars 0 forks source link

Enhancements to support "soft assay type": "Dataset Type" hierarchy #40

Closed AlanSimmons closed 1 month ago

AlanSimmons commented 8 months ago

Statement of problem

Datasets are being reorganized to support dynamic pipeline assignment--aka "soft assay type".


In the old infrastructure ("hard assay types", I guess), a static data_type/assay_type was created for every new combination of primary dataset and pipeline. For example, salmon_rnaseq_10x_v2 is a data_type for the result of a pipeline processing of primary datasets with data_type scRNAseq-genomics-v2.

In the new infrastructure,

  1. Datasets will be grouped into "Dataset Types", as defined in Column D of the Reference document.
  2. Instead of static values of data_type, a Rules Engine will dynamically characterize a dataset based on the Dataset Type and other ingest metadata.

Proposed solution

  1. Create a new Dataset Type hierarchy, with a parent node named Dataset Type and child nodes for each of the types defined in Column D of the Reference.
  2. Add all existing datasets to the corresponding Dataset Type node.


  1. Define a new data_type node that will indicate that the corresponding dataset has a soft assay type. Assign this data_type to all new dataset types.
AlanSimmons commented 8 months ago

Change log - HUBMAP

EFO Alignment (tangential to soft assay)

The following was required to allow for multiple hierarchies of dataset.

Nodes in range HUBMAP:C003000 to HUBMAP:C003032 had been created to organize datasets hierarchically at the request of a group led by Katy Börner, described as an attempt to align assays with EFO. The stated use case was a special Sankey chart. This hierarchy has yet to be used.

I created a new parent node for this hierarchy (HUBMAP:C002099), named EFO Alignment Dataset Hierarchy. I also related the first-level child nodes (e.g., HUBMAP:C003000) to HUBMAP:C002099. All existing "Dataset" nodes (children of HUBMAP:C000004) are also in the EFO Alignment Dataset Hierarchy.

Soft Assay Dataset Category, Soft Assay Dataset Type

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

In v1, each dataset is associated with a data_type (also called assay_type). Following are associations between existing datasets and the new Soft Assay Dataset Type, based on the v1 values of data_type.

The existing set of v1 values of data_type can be obtained with the API endpoint: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=HUBMAP&parent_code=C004001&child_sabs=HUBMAP

Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.

v1 Display Name in Portal data_type alt-names Primary/Processed Soft Assay Dataset Type
Visium Visium none Primary Visium
sciRNA-seq [Salmon] salmon_rnaseq_sciseq none Processed RNASeq
Bulk RNA-seq [Salmon] salmon_rnaseq_bulk none Processed RNASeq
scRNA-seq (10x Genomics) [Salmon] salmon_rnaseq_10x none Processed RNASeq
snRNA-seq [Salmon] Dataset salmon_sn_rnaseq_10x salmon_rnaseq_10x_sn Processed RNASeq
sciATAC-seq [SnapATAC] sc_atac_seq_sci none Primary ATACSeq
Bulk ATAC-seq [BWA + MACS2] bulk_atacseq none Processed ATACSeq
CODEX [Cytokit + SPRM] codex_cytokit none Processed CODEX
CODEX [Cytokit + SPRM] codex_cytokit_v1 none Processed CODEX
10x Multiome 10X Multiome Primary none 10x Multiome
Multiplexed IF Microscopy MxIF none Primary CyCIF
Cell DIVE cell-dive cell DIVE, Cell DIVE Primary Cell DIVE
CellDIVE [DeepCell + SPRM] celldive_deepcell none Processed Cell DIVE
PAS Stained Microscopy PAS PAS microscopy Primary Histology
NanoDESI IMS NanoDESI none Primary DESI
Multiplex Ion Beam Imaging MIBI none Multiplex Ion Beam Imaging, mibi Primary MIBI
Multiplex Ion Beam Imaging [DeepCell + SPRM] mibi_deepcell none Processed MIBI
Imaging Mass Cytometry (2D) IMC 2D-IMC, Imaging Mass Cytometry Primary 2D Imaging Mass Cytometry
LC-MS LC-MS none Primary LC-MS
Label-free LC-MS lc-ms_label-free none Primary LC-MS
Labeled LC-MS lc-ms_labeled none Primary LC-MS
Label-free LC-MS/MS lc-ms-ms_label-free none Primary LC-MS
Untargeted LC-MS LC-MS-untargeted none Primary LC-MS
Bulk RNA-seq bulk-RNA none Primary RNASeq
snATACseq (SNARE-seq2) SNARE-ATACseq2 SNAREseq, SNARE-seq2, SNARE2-ATACseq Primary ATACSeq
snRNAseq (SNARE-seq2) SNARE-RNAseq2 SNARE2-RNAseq Primary RNASeq
snATAC-seq (SNARE-seq2) [Lab Processed] sc_atac_seq_snare_lab none Primary ATACSeq
snRNA-seq (SNARE-seq2) [Lab Processed] sc_rna_seq_snare_lab none Primary RNASeq
snRNA-seq (SNARE-seq2) [Salmon] salmon_rnaseq_snareseq none Processed RNASeq
snATAC-seq (SNARE-seq2) [SnapATAC] sc_atac_seq_snare none Processed ATACSeq
scRNA-seq (10x Genomics v2) scRNAseq-10xGenomics-v2 none Primary RNASeq
scRNA-seq (10x Genomics v3) scRNAseq-10xGenomics-v3 none Primary RNASeq
sciATAC-seq sciATACseq none Primary ATACSeq
sciRNA-seq sciRNAseq none Primary RNASeq
snATAC-seq snATACseq none Primary ATACSeq
snATAC-seq [SnapATAC] sn_atac_seq sn_atac_seq_multiome_10x Primary ATACSeq
snRNA-seq (10x Genomics v2) snRNAseq-10xGenomics-v2 snRNAseq-v2 Primary RNASeq
snRNA-seq (10x Genomics v3) snRNAseq-10xGenomics-v3 snRNAseq, snRNAseq-v3 Primary RNASeq
Slide-seq Slide-seq none Primary RNASeq
Slide-seq [Salmon] salmon_rnaseq_slideseq none Processed RNASeq
Targeted Shotgun / Flow-injection LC-MS Targeted-Shotgun-LC-MS none Primary LC-MS
TMT LC-MS TMT-LC-MS none Primary LC-MS
LC-MS Bottom Up LC-MS_bottom_up LC-MS Bottom-Up Primary LC-MS
LC-MS Top Down LC-MS_top_down LC-MS Top-Down Primary LC-MS
Autofluorescence Microscopy AF none Primary Auto-fluorescence
Lightsheet Microscopy Lightsheet none Primary Lightsheet
Bulk ATAC-seq ATACseq-bulk bulkATACseq Primary ATACSeq

Unmapped existing v1 data_type

v1 Display Name in Portal data_type alt-names Primary/Processed
image_pyramid image_pyramid none n/a
Imaging Mass Cytometry (3D) IMC3D 3D-IMC, 3D Imaging Mass Cytometry Primary
NanoPOTS NanoPOTS none Primary
seqFISH seqFISH none Primary
seqFISH [Lab Processed] seqFish_lab_processed none Primary
Whole Genome Sequencing WGS none Primary
MS MS none Primary
MS Bottom Up MS_bottom_up MS Bottom-Up Primary
MS_top_down MS_top_down MS Top-Down Primary
Publication publication none n/a
Publication ancillary publication_ancillary none n/a
GeoMX GeoMX none Primary
Kaggle-1 Glomerulus Segmentation Dataset pas_ftu_segmentation none Processed

New pipelines from Column B

In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.

The plan is not to add information on these pipelines to UBKG.

Pipeline Dataset Type
salmon_rnaseq_10x_v2 RNASeq
salmon_rnaseq_10x_v2_sn RNASeq
sc_atac_seq_sn ATACSeq

New Datasets from Column C

These have not yet been defined in UBKG.

Overloaded concepts

Node terms in the SimpleKnowledge spreadsheet used to build HUBMAP have to be unique. A consequence of this is that some nodes are of multiple type. (Note: "Dataset Data Type" is a kind of node that corresponds to the v1 _datatype property)

Other changes

AlanSimmons commented 8 months ago

Change Log - SENNET

EFO Alignment (tangential to soft assay)

See Change Log for HUBMAP

Soft Assay Dataset Category, Soft Assay Dataset Type

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.

Note: to obtain the current set of v1 data_type values, use this URL: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=SENNET&parent_code=C004001&child_sabs=SENNET

v1 Display Name in Portal data_type alt-names Primary/Processed Soft Assay Dataset Type
Visium Visium none Primary Visium
Bulk RNA-seq [Salmon] salmon_rnaseq_bulk none Primary RNASeq
scRNA-seq (10x Genomics) [Salmon] salmon_rnaseq_10x Primary RNASeq
CODEX [Cytokit + SPRM] codex_cytokit none Processed CODEX
CODEX [Cytokit + SPRM] codex_cytokit_v1 none Processed CODEX
Multiplex Ion Beam Imaging MIBI Multiplex Ion Beam Imaging, mibi Primary MIBI
LC-MS LC-MS none Primary LC-MS
Lightsheet Microscopy Lightsheet none Primary Lightsheet
Bulk RNA-seq bulk-RNA none Primary RNASeq
snATAC-seq snATAC-seq none Primary ATACSeq
snRNA-seq snRNA-seq none Primary RNASeq
H&E Slide Staining Stained Slides none Primary Histology
Multiplex Ion Beam Imaging [DeepCell + SPRM] mibi_deepcell none Processed MIBI
scRNA-seq (10x Genomics v2) scRNAseq-10xGenomics-v2 none Primary RNASeq
scRNA-seq (10x Genomics v3) scRNAseq-10xGenomics-v3 scRNA-Seq(10xGenomics), scRNA-Seq-10x, scRNAseq-10xGenomics Primary RNASeq
snATAC-seq [SnapATAC] sn_atac_seq sn_atac_seq_multiome_10x Processed ATACSeq
snRNA-seq (10x Genomics v3) snRNAseq-10xGenomics-v3 snRNAseq, snRNAseq-v3 Primary RNASeq
snRNA-seq [Salmon] salmon_sn_rnaseq_10x none Primary RNASeq

Unmapped existing v1 data_type

v1 Display Name in Portal data_type alt-names Primary/Processed
CITE-Seq CITE-Seq none Primary
CosMX (RNA) CosMX (RNA) none Primary
DBiT-seq DBiT-seq none Primary
FACS - Fluorescence-activated Cell Sorting FACS - Fluorescence-activated Cell Sorting none Primary
GeoMX (RNA) GeoMX (RNA) none Primary
Mint-ChIP Mint-ChIP none Primary
SASP SASP none Primary
image_pyramid image_pyramid none n/a
Publication publication none n/a
Publication Ancillary publication_ancillary none n/a

New pipelines from Column B

In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.

The plan is not to add information on these pipelines to UBKG.

Pipeline Dataset Type
salmon_rnaseq_10x_v2 RNASeq
salmon_rnaseq_10x_v2_sn RNASeq
sc_atac_seq_sn ATACSeq

New Datasets from Column C

These have not yet been defined in UBKG for SENNET

Overloaded concepts

AlanSimmons commented 8 months ago

HUBMAP and SENNET soft assay data type mapping in UBKG

Results of query against local instance

AlanSimmons commented 8 months ago

Updated CSVs

UBKG CSVs containing new soft-assay hierarchy in Globus.