AlanSimmons commented 8 months ago

Statement of problem

Datasets are being reorganized to support dynamic pipeline assignment--aka "soft assay type".

In the old infrastructure ("hard assay types", I guess), a static data_type/assay_type was created for every new combination of primary dataset and pipeline. For example, salmon_rnaseq_10x_v2 is a data_type for the result of a pipeline processing of primary datasets with data_type scRNAseq-genomics-v2.

In the new infrastructure,

Datasets will be grouped into "Dataset Types", as defined in Column D of the Reference document.
Instead of static values of data_type, a Rules Engine will dynamically characterize a dataset based on the Dataset Type and other ingest metadata.

Proposed solution

Create a new Dataset Type hierarchy, with a parent node named Dataset Type and child nodes for each of the types defined in Column D of the Reference.
Add all existing datasets to the corresponding Dataset Type node.

Optional:

Define a new data_type node that will indicate that the corresponding dataset has a soft assay type. Assign this data_type to all new dataset types.

AlanSimmons commented 8 months ago

Change log - HUBMAP

EFO Alignment (tangential to soft assay)

The following was required to allow for multiple hierarchies of dataset.

Nodes in range HUBMAP:C003000 to HUBMAP:C003032 had been created to organize datasets hierarchically at the request of a group led by Katy Börner, described as an attempt to align assays with EFO. The stated use case was a special Sankey chart. This hierarchy has yet to be used.

I created a new parent node for this hierarchy (HUBMAP:C002099), named EFO Alignment Dataset Hierarchy. I also related the first-level child nodes (e.g., HUBMAP:C003000) to HUBMAP:C002099. All existing "Dataset" nodes (children of HUBMAP:C000004) are also in the EFO Alignment Dataset Hierarchy.

Soft Assay Dataset Category, Soft Assay Dataset Type

HUBMAP:C003040 = Soft Assay Dataset Category, isa HUBMAP:C000004 (HuBMAP Dataset). Corresponds to Column A of the Pipeline Decision Rules document. I was informed that this is an arbitrary grouping; however, if I am asked to model something in UBKG that has a grouping column, I'm going to create a node hierarchy for that column in UBKG.
HUBMAP:C003041 = Soft Assay Dataset Type, isa HUBMAP:C000004 (HuBMAP Dataset). Corresponds to Column D.
HUBMAP:C003042 through HUBMAP:C003051 - isa HUBMAP:C003040 (Soft Assay Dataset Category)
HUBMAP:C003052 through X
- isa HUBMAP:C003041 (Soft Assay Dataset Type)
- mapped via isa to appropriate category. For example, HUBMAP:C003052 (RNASeq) isa HUBMAP:C003043 (RNA Sequencing)

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

In v1, each dataset is associated with a data_type (also called assay_type). Following are associations between existing datasets and the new Soft Assay Dataset Type, based on the v1 values of data_type.

The existing set of v1 values of data_type can be obtained with the API endpoint: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=HUBMAP&parent_code=C004001&child_sabs=HUBMAP

Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.

v1 Display Name in Portal	data_type	alt-names	Primary/Processed	Soft Assay Dataset Type
Visium	Visium	none	Primary	Visium
sciRNA-seq [Salmon]	salmon_rnaseq_sciseq	none	Processed	RNASeq
Bulk RNA-seq [Salmon]	salmon_rnaseq_bulk	none	Processed	RNASeq
scRNA-seq (10x Genomics) [Salmon]	salmon_rnaseq_10x	none	Processed	RNASeq
snRNA-seq [Salmon] Dataset	salmon_sn_rnaseq_10x	salmon_rnaseq_10x_sn	Processed	RNASeq
sciATAC-seq [SnapATAC]	sc_atac_seq_sci	none	Primary	ATACSeq
Bulk ATAC-seq [BWA + MACS2]	bulk_atacseq	none	Processed	ATACSeq
CODEX [Cytokit + SPRM]	codex_cytokit	none	Processed	CODEX
CODEX [Cytokit + SPRM]	codex_cytokit_v1	none	Processed	CODEX
10x Multiome	10X Multiome	Primary	none	10x Multiome
Multiplexed IF Microscopy	MxIF	none	Primary	CyCIF
Cell DIVE	cell-dive	cell DIVE, Cell DIVE	Primary	Cell DIVE
CellDIVE [DeepCell + SPRM]	celldive_deepcell	none	Processed	Cell DIVE
PAS Stained Microscopy	PAS	PAS microscopy	Primary	Histology
MALDI IMS	MALDI-IMS	MALDI-IMS-neg, MALDI-IMS-pos	Primary	MALDI
SIMS-IMS	SIMS-IMS	SIMS	Primary	SIMS
DESI	DESI	DESI-IMS, DESI IMS	Primary	DESI
NanoDESI IMS	NanoDESI	none	Primary	DESI
Multiplex Ion Beam Imaging	MIBI	none	Multiplex Ion Beam Imaging, mibi	Primary	MIBI
Multiplex Ion Beam Imaging [DeepCell + SPRM]	mibi_deepcell	none	Processed	MIBI
Imaging Mass Cytometry (2D)	IMC	2D-IMC, Imaging Mass Cytometry	Primary	2D Imaging Mass Cytometry
LC-MS	LC-MS	none	Primary	LC-MS
Label-free LC-MS	lc-ms_label-free	none	Primary	LC-MS
Labeled LC-MS	lc-ms_labeled	none	Primary	LC-MS
Label-free LC-MS/MS	lc-ms-ms_label-free	none	Primary	LC-MS
Untargeted LC-MS	LC-MS-untargeted	none	Primary	LC-MS
Bulk RNA-seq	bulk-RNA	none	Primary	RNASeq
snATACseq (SNARE-seq2)	SNARE-ATACseq2	SNAREseq, SNARE-seq2, SNARE2-ATACseq	Primary	ATACSeq
snRNAseq (SNARE-seq2)	SNARE-RNAseq2	SNARE2-RNAseq	Primary	RNASeq
snATAC-seq (SNARE-seq2) [Lab Processed]	sc_atac_seq_snare_lab	none	Primary	ATACSeq
snRNA-seq (SNARE-seq2) [Lab Processed]	sc_rna_seq_snare_lab	none	Primary	RNASeq
snRNA-seq (SNARE-seq2) [Salmon]	salmon_rnaseq_snareseq	none	Processed	RNASeq
snATAC-seq (SNARE-seq2) [SnapATAC]	sc_atac_seq_snare	none	Processed	ATACSeq
scRNA-seq (10x Genomics v2)	scRNAseq-10xGenomics-v2	none	Primary	RNASeq
scRNA-seq (10x Genomics v3)	scRNAseq-10xGenomics-v3	none	Primary	RNASeq
sciATAC-seq	sciATACseq	none	Primary	ATACSeq
sciRNA-seq	sciRNAseq	none	Primary	RNASeq
snATAC-seq	snATACseq	none	Primary	ATACSeq
snATAC-seq [SnapATAC]	sn_atac_seq	sn_atac_seq_multiome_10x	Primary	ATACSeq
snRNA-seq (10x Genomics v2)	snRNAseq-10xGenomics-v2	snRNAseq-v2	Primary	RNASeq
snRNA-seq (10x Genomics v3)	snRNAseq-10xGenomics-v3	snRNAseq, snRNAseq-v3	Primary	RNASeq
Slide-seq	Slide-seq	none	Primary	RNASeq
Slide-seq [Salmon]	salmon_rnaseq_slideseq	none	Processed	RNASeq
Targeted Shotgun / Flow-injection LC-MS	Targeted-Shotgun-LC-MS	none	Primary	LC-MS
TMT LC-MS	TMT-LC-MS	none	Primary	LC-MS
LC-MS Bottom Up	LC-MS_bottom_up	LC-MS Bottom-Up	Primary	LC-MS
LC-MS Top Down	LC-MS_top_down	LC-MS Top-Down	Primary	LC-MS
Autofluorescence Microscopy	AF	none	Primary	Auto-fluorescence
Lightsheet Microscopy	Lightsheet	none	Primary	Lightsheet
Bulk ATAC-seq	ATACseq-bulk	bulkATACseq	Primary	ATACSeq

Unmapped existing v1 data_type

v1 Display Name in Portal	data_type	alt-names	Primary/Processed
DART-FISH	DART-FISH	none	Primary
image_pyramid	image_pyramid	none	n/a
Imaging Mass Cytometry (3D)	IMC3D	3D-IMC, 3D Imaging Mass Cytometry	Primary
NanoPOTS	NanoPOTS	none	Primary
seqFISH	seqFISH	none	Primary
seqFISH [Lab Processed]	seqFish_lab_processed	none	Primary
Whole Genome Sequencing	WGS	none	Primary
MS	MS	none	Primary
MS Bottom Up	MS_bottom_up	MS Bottom-Up	Primary
MS_top_down	MS_top_down	MS Top-Down	Primary
Publication	publication	none	n/a
Publication ancillary	publication_ancillary	none	n/a
GeoMX	GeoMX	none	Primary
Kaggle-1 Glomerulus Segmentation Dataset	pas_ftu_segmentation	none	Processed

New pipelines from Column B

In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.

The plan is not to add information on these pipelines to UBKG.

Pipeline	Dataset Type
salmon_rnaseq_10x_v2	RNASeq
salmon_rnaseq_10x_v2_sn	RNASeq
sc_atac_seq_sn	ATACSeq

New Datasets from Column C

These have not yet been defined in UBKG.

Visium no probes
Visium with probes
Visium (with probes)
Visium 🦄with probes🦄
10x v2 scRNAseq
10x v2 snRNA
PhenoCycler
MERFISH
H&E
nanoSPLITS
Confocal
Thick section Multiphoton MxIF
Second Harmonic Generation (SHG)
Enhanced Stimulated Raman Spectroscopy (SRS)
Molecular Cartography

Overloaded concepts

Node terms in the SimpleKnowledge spreadsheet used to build HUBMAP have to be unique. A consequence of this is that some nodes are of multiple type. (Note: "Dataset Data Type" is a kind of node that corresponds to the v1 _datatype property)

HUBMAP:C014103 (Visium)
- from v1: isa Dataset Display Name,Dataset Data Type
- now: isa Soft Assay Dataset Type,Spatial Transcriptomics
HUBMAP:C014004 (10x Multiome)
- from v1: isa Display Name
- now: isa Soft Assay Dataset Type,Soft Assay Dataset Category,10x Multiome
- Note: the concept has a isa relationship with itself, because the string "10x Multiome" is assigned to both a category and data type.
- HUBMAP:C006503 (CODEX)
- from v1: isa Dataset Data Type,Dataset Display Name
- now: isa Soft Assay Dataset Type,MxFBE
- HUBMAP:C006303 (Cell DIVE)
- from v1: isa Dataset Display Name
- now: isa Soft Assay Dataset Type,MxFBE
- HUBMAP:C003047 (Histology)
- isa Soft Assay Dataset Category,Soft Assay Dataset Type,Histology
- Note: the concept has a isa relationship with itself, because the string "Histology" is assigned to both a category and a data type.
- HUBMAP:C007804 (MIBI)
- v1: isa Dataset Data Type
- now: isa Soft Assay Dataset Type,MxNF
- HUBMAP:C011902 (LC-MS)
- v1: isa Dataset Display Name,Dataset Data Type
- now: isa Soft Assay Dataset Type
HUBMAP:C003051 (Molecular Cartography) isa Soft Assay Dataset Category,Soft Assay Dataset Type,Molecular Cartography
HUBMAP:C007604 (Lightsheet)
- v1: isa Dataset Data Type
- now: Soft Assay Dataset Type,Single-cycle Flourescence Microscopy
HUBMAP:C012502 (DESI)
- v1: isa Dataset Data Type,Dataset Display Name
- now: isa Soft Assay Dataset Type

Other changes

HUBMAP:C004020 term "maldi" changed to "maldi_vitessce_hint" so that the string "MALDI" could be used as a dataset type.
HUBMAP:C002037 (cells) - synonym "histology" removed so that the string "Histology" can be used as a dataset type.

AlanSimmons commented 8 months ago

Change Log - SENNET

EFO Alignment (tangential to soft assay)

See Change Log for HUBMAP

Soft Assay Dataset Category, Soft Assay Dataset Type

SENNET:C003040 = Soft Assay Dataset Category, isa SENNET:C000004 (HuBMAP Dataset). Corresponds to Column A of the Pipeline Decision Rules document.
SENNET:C003041 = Soft Assay Dataset Type, isa SENNET:C000004 (HuBMAP Dataset). Corresponds to Column D.
SENNET:C003042 through SENNET:C003051 - isa SENNET:C003040 (Soft Assay Dataset Category)
SENNET:C003052 through X
- isa HUBMAP:C003041 (Soft Assay Dataset Type)
- mapped via isa to appropriate category. For example, HUBMAP:SENNET (RNASeq) isa SENNET:C003043 (RNA Sequencing)

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.

Note: to obtain the current set of v1 data_type values, use this URL: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=SENNET&parent_code=C004001&child_sabs=SENNET

v1 Display Name in Portal	data_type	alt-names	Primary/Processed	Soft Assay Dataset Type
Visium	Visium	none	Primary	Visium
Bulk RNA-seq [Salmon]	salmon_rnaseq_bulk	none	Primary	RNASeq
scRNA-seq (10x Genomics) [Salmon]	salmon_rnaseq_10x	Primary	RNASeq
CODEX [Cytokit + SPRM]	codex_cytokit	none	Processed	CODEX
CODEX [Cytokit + SPRM]	codex_cytokit_v1	none	Processed	CODEX
Multiplex Ion Beam Imaging	MIBI	Multiplex Ion Beam Imaging, mibi	Primary	MIBI
LC-MS	LC-MS	none	Primary	LC-MS
Lightsheet Microscopy	Lightsheet	none	Primary	Lightsheet
Bulk RNA-seq	bulk-RNA	none	Primary	RNASeq
snATAC-seq	snATAC-seq	none	Primary	ATACSeq
snRNA-seq	snRNA-seq	none	Primary	RNASeq
H&E Slide Staining	Stained Slides	none	Primary	Histology
Multiplex Ion Beam Imaging [DeepCell + SPRM]	mibi_deepcell	none	Processed	MIBI
scRNA-seq (10x Genomics v2)	scRNAseq-10xGenomics-v2	none	Primary	RNASeq
scRNA-seq (10x Genomics v3)	scRNAseq-10xGenomics-v3	scRNA-Seq(10xGenomics), scRNA-Seq-10x, scRNAseq-10xGenomics	Primary	RNASeq
snATAC-seq [SnapATAC]	sn_atac_seq	sn_atac_seq_multiome_10x	Processed	ATACSeq
snRNA-seq (10x Genomics v3)	snRNAseq-10xGenomics-v3	snRNAseq, snRNAseq-v3	Primary	RNASeq
snRNA-seq [Salmon]	salmon_sn_rnaseq_10x	none	Primary	RNASeq

Unmapped existing v1 data_type

v1 Display Name in Portal	data_type	alt-names	Primary/Processed
CITE-Seq	CITE-Seq	none	Primary
CosMX (RNA)	CosMX (RNA)	none	Primary
DBiT-seq	DBiT-seq	none	Primary
FACS - Fluorescence-activated Cell Sorting	FACS - Fluorescence-activated Cell Sorting	none	Primary
GeoMX (RNA)	GeoMX (RNA)	none	Primary
Mint-ChIP	Mint-ChIP	none	Primary
SASP	SASP	none	Primary
image_pyramid	image_pyramid	none	n/a
Publication	publication	none	n/a
Publication Ancillary	publication_ancillary	none	n/a

New pipelines from Column B

In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.

The plan is not to add information on these pipelines to UBKG.

Pipeline	Dataset Type
salmon_rnaseq_10x_v2	RNASeq
salmon_rnaseq_10x_v2_sn	RNASeq
sc_atac_seq_sn	ATACSeq

New Datasets from Column C

These have not yet been defined in UBKG for SENNET

Visium no probes
Visium with probes
Visium (with probes)
Visium 🦄with probes🦄
10x v2 scRNAseq
10x v2 snRNA
PhenoCycler
MERFISH
H&E
nanoSPLITS
Confocal
Thick section Multiphoton MxIF
Second Harmonic Generation (SHG)
Enhanced Stimulated Raman Spectroscopy (SRS)
Molecular Cartography
Auto-fluorescence

Overloaded concepts

SENNET:C014004
- v1: isa Display Name
- now: isa 10x Multiome
SENNET:C006504
- v1: isa Dataset Data Type,Dataset Display Name
- now: isa Soft Assay Dataset Type

AlanSimmons commented 8 months ago

HUBMAP and SENNET soft assay data type mapping in UBKG

Results of query against local instance

AlanSimmons commented 8 months ago

Updated CSVs

UBKG CSVs containing new soft-assay hierarchy in Globus.

x-atlas-consortia / ubkg-neo4j

Enhancements to support "soft assay type": "Dataset Type" hierarchy #40

Statement of problem

Proposed solution

Optional:

Change log - HUBMAP

EFO Alignment (tangential to soft assay)

Soft Assay Dataset Category, Soft Assay Dataset Type

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

Unmapped existing v1 data_type

New pipelines from Column B

New Datasets from Column C

Overloaded concepts

Other changes

Change Log - SENNET

EFO Alignment (tangential to soft assay)

Soft Assay Dataset Category, Soft Assay Dataset Type

Existing (v1) Dataset data_type:Soft Assay Data Type mappings

Unmapped existing v1 data_type

New pipelines from Column B

New Datasets from Column C

Overloaded concepts

HUBMAP and SENNET soft assay data type mapping in UBKG

Updated CSVs