x-atlas-consortia / ubkg-neo4j

A container implementation to serve the Unified Biomedical Knowledge Graph in Neo4j
MIT License
1 stars 0 forks source link

Enhancements to support "soft assay type": "Dataset Type" hierarchy #40

Closed AlanSimmons closed 1 month ago

AlanSimmons commented 8 months ago

Statement of problem

Datasets are being reorganized to support dynamic pipeline assignment--aka "soft assay type".

Reference

In the old infrastructure ("hard assay types", I guess), a static data_type/assay_type was created for every new combination of primary dataset and pipeline. For example, salmon_rnaseq_10x_v2 is a data_type for the result of a pipeline processing of primary datasets with data_type scRNAseq-genomics-v2.

In the new infrastructure,

  1. Datasets will be grouped into "Dataset Types", as defined in Column D of the Reference document.
  2. Instead of static values of data_type, a Rules Engine will dynamically characterize a dataset based on the Dataset Type and other ingest metadata.

Proposed solution

  1. Create a new Dataset Type hierarchy, with a parent node named Dataset Type and child nodes for each of the types defined in Column D of the Reference.
  2. Add all existing datasets to the corresponding Dataset Type node.

Optional:

  1. Define a new data_type node that will indicate that the corresponding dataset has a soft assay type. Assign this data_type to all new dataset types.
AlanSimmons commented 8 months ago

Change log - HUBMAP

EFO Alignment (tangential to soft assay)

The following was required to allow for multiple hierarchies of dataset.

Nodes in range HUBMAP:C003000 to HUBMAP:C003032 had been created to organize datasets hierarchically at the request of a group led by Katy Börner, described as an attempt to align assays with EFO. The stated use case was a special Sankey chart. This hierarchy has yet to be used.

I created a new parent node for this hierarchy (HUBMAP:C002099), named EFO Alignment Dataset Hierarchy. I also related the first-level child nodes (e.g., HUBMAP:C003000) to HUBMAP:C002099. All existing "Dataset" nodes (children of HUBMAP:C000004) are also in the EFO Alignment Dataset Hierarchy.

Soft Assay Dataset Category, Soft Assay Dataset Type

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

In v1, each dataset is associated with a data_type (also called assay_type). Following are associations between existing datasets and the new Soft Assay Dataset Type, based on the v1 values of data_type.

The existing set of v1 values of data_type can be obtained with the API endpoint: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=HUBMAP&parent_code=C004001&child_sabs=HUBMAP

Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.

v1 Display Name in Portal data_type alt-names Primary/Processed Soft Assay Dataset Type
Visium Visium none Primary Visium
sciRNA-seq [Salmon] salmon_rnaseq_sciseq none Processed RNASeq
Bulk RNA-seq [Salmon] salmon_rnaseq_bulk none Processed RNASeq
scRNA-seq (10x Genomics) [Salmon] salmon_rnaseq_10x none Processed RNASeq
snRNA-seq [Salmon] Dataset salmon_sn_rnaseq_10x salmon_rnaseq_10x_sn Processed RNASeq
sciATAC-seq [SnapATAC] sc_atac_seq_sci none Primary ATACSeq
Bulk ATAC-seq [BWA + MACS2] bulk_atacseq none Processed ATACSeq
CODEX [Cytokit + SPRM] codex_cytokit none Processed CODEX
CODEX [Cytokit + SPRM] codex_cytokit_v1 none Processed CODEX
10x Multiome 10X Multiome Primary none 10x Multiome
Multiplexed IF Microscopy MxIF none Primary CyCIF
Cell DIVE cell-dive cell DIVE, Cell DIVE Primary Cell DIVE
CellDIVE [DeepCell + SPRM] celldive_deepcell none Processed Cell DIVE
PAS Stained Microscopy PAS PAS microscopy Primary Histology
MALDI IMS MALDI-IMS MALDI-IMS-neg, MALDI-IMS-pos Primary MALDI
SIMS-IMS SIMS-IMS SIMS Primary SIMS
DESI DESI DESI-IMS, DESI IMS Primary DESI
NanoDESI IMS NanoDESI none Primary DESI
Multiplex Ion Beam Imaging MIBI none Multiplex Ion Beam Imaging, mibi Primary MIBI
Multiplex Ion Beam Imaging [DeepCell + SPRM] mibi_deepcell none Processed MIBI
Imaging Mass Cytometry (2D) IMC 2D-IMC, Imaging Mass Cytometry Primary 2D Imaging Mass Cytometry
LC-MS LC-MS none Primary LC-MS
Label-free LC-MS lc-ms_label-free none Primary LC-MS
Labeled LC-MS lc-ms_labeled none Primary LC-MS
Label-free LC-MS/MS lc-ms-ms_label-free none Primary LC-MS
Untargeted LC-MS LC-MS-untargeted none Primary LC-MS
Bulk RNA-seq bulk-RNA none Primary RNASeq
snATACseq (SNARE-seq2) SNARE-ATACseq2 SNAREseq, SNARE-seq2, SNARE2-ATACseq Primary ATACSeq
snRNAseq (SNARE-seq2) SNARE-RNAseq2 SNARE2-RNAseq Primary RNASeq
snATAC-seq (SNARE-seq2) [Lab Processed] sc_atac_seq_snare_lab none Primary ATACSeq
snRNA-seq (SNARE-seq2) [Lab Processed] sc_rna_seq_snare_lab none Primary RNASeq
snRNA-seq (SNARE-seq2) [Salmon] salmon_rnaseq_snareseq none Processed RNASeq
snATAC-seq (SNARE-seq2) [SnapATAC] sc_atac_seq_snare none Processed ATACSeq
scRNA-seq (10x Genomics v2) scRNAseq-10xGenomics-v2 none Primary RNASeq
scRNA-seq (10x Genomics v3) scRNAseq-10xGenomics-v3 none Primary RNASeq
sciATAC-seq sciATACseq none Primary ATACSeq
sciRNA-seq sciRNAseq none Primary RNASeq
snATAC-seq snATACseq none Primary ATACSeq
snATAC-seq [SnapATAC] sn_atac_seq sn_atac_seq_multiome_10x Primary ATACSeq
snRNA-seq (10x Genomics v2) snRNAseq-10xGenomics-v2 snRNAseq-v2 Primary RNASeq
snRNA-seq (10x Genomics v3) snRNAseq-10xGenomics-v3 snRNAseq, snRNAseq-v3 Primary RNASeq
Slide-seq Slide-seq none Primary RNASeq
Slide-seq [Salmon] salmon_rnaseq_slideseq none Processed RNASeq
Targeted Shotgun / Flow-injection LC-MS Targeted-Shotgun-LC-MS none Primary LC-MS
TMT LC-MS TMT-LC-MS none Primary LC-MS
LC-MS Bottom Up LC-MS_bottom_up LC-MS Bottom-Up Primary LC-MS
LC-MS Top Down LC-MS_top_down LC-MS Top-Down Primary LC-MS
Autofluorescence Microscopy AF none Primary Auto-fluorescence
Lightsheet Microscopy Lightsheet none Primary Lightsheet
Bulk ATAC-seq ATACseq-bulk bulkATACseq Primary ATACSeq

Unmapped existing v1 data_type

v1 Display Name in Portal data_type alt-names Primary/Processed
DART-FISH DART-FISH none Primary
image_pyramid image_pyramid none n/a
Imaging Mass Cytometry (3D) IMC3D 3D-IMC, 3D Imaging Mass Cytometry Primary
NanoPOTS NanoPOTS none Primary
seqFISH seqFISH none Primary
seqFISH [Lab Processed] seqFish_lab_processed none Primary
Whole Genome Sequencing WGS none Primary
MS MS none Primary
MS Bottom Up MS_bottom_up MS Bottom-Up Primary
MS_top_down MS_top_down MS Top-Down Primary
Publication publication none n/a
Publication ancillary publication_ancillary none n/a
GeoMX GeoMX none Primary
Kaggle-1 Glomerulus Segmentation Dataset pas_ftu_segmentation none Processed

New pipelines from Column B

In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.

The plan is not to add information on these pipelines to UBKG.

Pipeline Dataset Type
salmon_rnaseq_10x_v2 RNASeq
salmon_rnaseq_10x_v2_sn RNASeq
sc_atac_seq_sn ATACSeq

New Datasets from Column C

These have not yet been defined in UBKG.

Overloaded concepts

Node terms in the SimpleKnowledge spreadsheet used to build HUBMAP have to be unique. A consequence of this is that some nodes are of multiple type. (Note: "Dataset Data Type" is a kind of node that corresponds to the v1 _datatype property)

Other changes

AlanSimmons commented 8 months ago

Change Log - SENNET

EFO Alignment (tangential to soft assay)

See Change Log for HUBMAP

Soft Assay Dataset Category, Soft Assay Dataset Type

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.

Note: to obtain the current set of v1 data_type values, use this URL: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=SENNET&parent_code=C004001&child_sabs=SENNET

v1 Display Name in Portal data_type alt-names Primary/Processed Soft Assay Dataset Type
Visium Visium none Primary Visium
Bulk RNA-seq [Salmon] salmon_rnaseq_bulk none Primary RNASeq
scRNA-seq (10x Genomics) [Salmon] salmon_rnaseq_10x Primary RNASeq
CODEX [Cytokit + SPRM] codex_cytokit none Processed CODEX
CODEX [Cytokit + SPRM] codex_cytokit_v1 none Processed CODEX
Multiplex Ion Beam Imaging MIBI Multiplex Ion Beam Imaging, mibi Primary MIBI
LC-MS LC-MS none Primary LC-MS
Lightsheet Microscopy Lightsheet none Primary Lightsheet
Bulk RNA-seq bulk-RNA none Primary RNASeq
snATAC-seq snATAC-seq none Primary ATACSeq
snRNA-seq snRNA-seq none Primary RNASeq
H&E Slide Staining Stained Slides none Primary Histology
Multiplex Ion Beam Imaging [DeepCell + SPRM] mibi_deepcell none Processed MIBI
scRNA-seq (10x Genomics v2) scRNAseq-10xGenomics-v2 none Primary RNASeq
scRNA-seq (10x Genomics v3) scRNAseq-10xGenomics-v3 scRNA-Seq(10xGenomics), scRNA-Seq-10x, scRNAseq-10xGenomics Primary RNASeq
snATAC-seq [SnapATAC] sn_atac_seq sn_atac_seq_multiome_10x Processed ATACSeq
snRNA-seq (10x Genomics v3) snRNAseq-10xGenomics-v3 snRNAseq, snRNAseq-v3 Primary RNASeq
snRNA-seq [Salmon] salmon_sn_rnaseq_10x none Primary RNASeq

Unmapped existing v1 data_type

v1 Display Name in Portal data_type alt-names Primary/Processed
CITE-Seq CITE-Seq none Primary
CosMX (RNA) CosMX (RNA) none Primary
DBiT-seq DBiT-seq none Primary
FACS - Fluorescence-activated Cell Sorting FACS - Fluorescence-activated Cell Sorting none Primary
GeoMX (RNA) GeoMX (RNA) none Primary
Mint-ChIP Mint-ChIP none Primary
SASP SASP none Primary
image_pyramid image_pyramid none n/a
Publication publication none n/a
Publication Ancillary publication_ancillary none n/a

New pipelines from Column B

In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.

The plan is not to add information on these pipelines to UBKG.

Pipeline Dataset Type
salmon_rnaseq_10x_v2 RNASeq
salmon_rnaseq_10x_v2_sn RNASeq
sc_atac_seq_sn ATACSeq

New Datasets from Column C

These have not yet been defined in UBKG for SENNET

Overloaded concepts

AlanSimmons commented 8 months ago

HUBMAP and SENNET soft assay data type mapping in UBKG

Results of query against local instance

AlanSimmons commented 8 months ago

Updated CSVs

UBKG CSVs containing new soft-assay hierarchy in Globus.