Closed AlanSimmons closed 1 month ago
The following was required to allow for multiple hierarchies of dataset.
Nodes in range HUBMAP:C003000 to HUBMAP:C003032 had been created to organize datasets hierarchically at the request of a group led by Katy Börner, described as an attempt to align assays with EFO. The stated use case was a special Sankey chart. This hierarchy has yet to be used.
I created a new parent node for this hierarchy (HUBMAP:C002099), named EFO Alignment Dataset Hierarchy. I also related the first-level child nodes (e.g., HUBMAP:C003000) to HUBMAP:C002099. All existing "Dataset" nodes (children of HUBMAP:C000004) are also in the EFO Alignment Dataset Hierarchy.
In v1, each dataset is associated with a data_type (also called assay_type). Following are associations between existing datasets and the new Soft Assay Dataset Type, based on the v1 values of data_type.
The existing set of v1 values of data_type can be obtained with the API endpoint: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=HUBMAP&parent_code=C004001&child_sabs=HUBMAP
Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.
v1 Display Name in Portal | data_type | alt-names | Primary/Processed | Soft Assay Dataset Type | |
---|---|---|---|---|---|
Visium | Visium | none | Primary | Visium | |
sciRNA-seq [Salmon] | salmon_rnaseq_sciseq | none | Processed | RNASeq | |
Bulk RNA-seq [Salmon] | salmon_rnaseq_bulk | none | Processed | RNASeq | |
scRNA-seq (10x Genomics) [Salmon] | salmon_rnaseq_10x | none | Processed | RNASeq | |
snRNA-seq [Salmon] Dataset | salmon_sn_rnaseq_10x | salmon_rnaseq_10x_sn | Processed | RNASeq | |
sciATAC-seq [SnapATAC] | sc_atac_seq_sci | none | Primary | ATACSeq | |
Bulk ATAC-seq [BWA + MACS2] | bulk_atacseq | none | Processed | ATACSeq | |
CODEX [Cytokit + SPRM] | codex_cytokit | none | Processed | CODEX | |
CODEX [Cytokit + SPRM] | codex_cytokit_v1 | none | Processed | CODEX | |
10x Multiome | 10X Multiome | Primary | none | 10x Multiome | |
Multiplexed IF Microscopy | MxIF | none | Primary | CyCIF | |
Cell DIVE | cell-dive | cell DIVE, Cell DIVE | Primary | Cell DIVE | |
CellDIVE [DeepCell + SPRM] | celldive_deepcell | none | Processed | Cell DIVE | |
PAS Stained Microscopy | PAS | PAS microscopy | Primary | Histology | |
MALDI IMS | MALDI-IMS | MALDI-IMS-neg, MALDI-IMS-pos | Primary | MALDI | |
SIMS-IMS | SIMS-IMS | SIMS | Primary | SIMS | |
DESI | DESI | DESI-IMS, DESI IMS | Primary | DESI | |
NanoDESI IMS | NanoDESI | none | Primary | DESI | |
Multiplex Ion Beam Imaging | MIBI | none | Multiplex Ion Beam Imaging, mibi | Primary | MIBI |
Multiplex Ion Beam Imaging [DeepCell + SPRM] | mibi_deepcell | none | Processed | MIBI | |
Imaging Mass Cytometry (2D) | IMC | 2D-IMC, Imaging Mass Cytometry | Primary | 2D Imaging Mass Cytometry | |
LC-MS | LC-MS | none | Primary | LC-MS | |
Label-free LC-MS | lc-ms_label-free | none | Primary | LC-MS | |
Labeled LC-MS | lc-ms_labeled | none | Primary | LC-MS | |
Label-free LC-MS/MS | lc-ms-ms_label-free | none | Primary | LC-MS | |
Untargeted LC-MS | LC-MS-untargeted | none | Primary | LC-MS | |
Bulk RNA-seq | bulk-RNA | none | Primary | RNASeq | |
snATACseq (SNARE-seq2) | SNARE-ATACseq2 | SNAREseq, SNARE-seq2, SNARE2-ATACseq | Primary | ATACSeq | |
snRNAseq (SNARE-seq2) | SNARE-RNAseq2 | SNARE2-RNAseq | Primary | RNASeq | |
snATAC-seq (SNARE-seq2) [Lab Processed] | sc_atac_seq_snare_lab | none | Primary | ATACSeq | |
snRNA-seq (SNARE-seq2) [Lab Processed] | sc_rna_seq_snare_lab | none | Primary | RNASeq | |
snRNA-seq (SNARE-seq2) [Salmon] | salmon_rnaseq_snareseq | none | Processed | RNASeq | |
snATAC-seq (SNARE-seq2) [SnapATAC] | sc_atac_seq_snare | none | Processed | ATACSeq | |
scRNA-seq (10x Genomics v2) | scRNAseq-10xGenomics-v2 | none | Primary | RNASeq | |
scRNA-seq (10x Genomics v3) | scRNAseq-10xGenomics-v3 | none | Primary | RNASeq | |
sciATAC-seq | sciATACseq | none | Primary | ATACSeq | |
sciRNA-seq | sciRNAseq | none | Primary | RNASeq | |
snATAC-seq | snATACseq | none | Primary | ATACSeq | |
snATAC-seq [SnapATAC] | sn_atac_seq | sn_atac_seq_multiome_10x | Primary | ATACSeq | |
snRNA-seq (10x Genomics v2) | snRNAseq-10xGenomics-v2 | snRNAseq-v2 | Primary | RNASeq | |
snRNA-seq (10x Genomics v3) | snRNAseq-10xGenomics-v3 | snRNAseq, snRNAseq-v3 | Primary | RNASeq | |
Slide-seq | Slide-seq | none | Primary | RNASeq | |
Slide-seq [Salmon] | salmon_rnaseq_slideseq | none | Processed | RNASeq | |
Targeted Shotgun / Flow-injection LC-MS | Targeted-Shotgun-LC-MS | none | Primary | LC-MS | |
TMT LC-MS | TMT-LC-MS | none | Primary | LC-MS | |
LC-MS Bottom Up | LC-MS_bottom_up | LC-MS Bottom-Up | Primary | LC-MS | |
LC-MS Top Down | LC-MS_top_down | LC-MS Top-Down | Primary | LC-MS | |
Autofluorescence Microscopy | AF | none | Primary | Auto-fluorescence | |
Lightsheet Microscopy | Lightsheet | none | Primary | Lightsheet | |
Bulk ATAC-seq | ATACseq-bulk | bulkATACseq | Primary | ATACSeq |
v1 Display Name in Portal | data_type | alt-names | Primary/Processed |
---|---|---|---|
DART-FISH | DART-FISH | none | Primary |
image_pyramid | image_pyramid | none | n/a |
Imaging Mass Cytometry (3D) | IMC3D | 3D-IMC, 3D Imaging Mass Cytometry | Primary |
NanoPOTS | NanoPOTS | none | Primary |
seqFISH | seqFISH | none | Primary |
seqFISH [Lab Processed] | seqFish_lab_processed | none | Primary |
Whole Genome Sequencing | WGS | none | Primary |
MS | MS | none | Primary |
MS Bottom Up | MS_bottom_up | MS Bottom-Up | Primary |
MS_top_down | MS_top_down | MS Top-Down | Primary |
Publication | publication | none | n/a |
Publication ancillary | publication_ancillary | none | n/a |
GeoMX | GeoMX | none | Primary |
Kaggle-1 Glomerulus Segmentation Dataset | pas_ftu_segmentation | none | Processed |
In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.
The plan is not to add information on these pipelines to UBKG.
Pipeline | Dataset Type |
---|---|
salmon_rnaseq_10x_v2 | RNASeq |
salmon_rnaseq_10x_v2_sn | RNASeq |
sc_atac_seq_sn | ATACSeq |
These have not yet been defined in UBKG.
Node terms in the SimpleKnowledge spreadsheet used to build HUBMAP have to be unique. A consequence of this is that some nodes are of multiple type. (Note: "Dataset Data Type" is a kind of node that corresponds to the v1 _datatype property)
See Change Log for HUBMAP
Datasets in bold were not mentioned in the Pipeline Decision Rules spreadsheet.
Note: to obtain the current set of v1 data_type values, use this URL: https://ontology.api.hubmapconsortium.org/valueset?parent_sab=SENNET&parent_code=C004001&child_sabs=SENNET
v1 Display Name in Portal | data_type | alt-names | Primary/Processed | Soft Assay Dataset Type |
---|---|---|---|---|
Visium | Visium | none | Primary | Visium |
Bulk RNA-seq [Salmon] | salmon_rnaseq_bulk | none | Primary | RNASeq |
scRNA-seq (10x Genomics) [Salmon] | salmon_rnaseq_10x | Primary | RNASeq | |
CODEX [Cytokit + SPRM] | codex_cytokit | none | Processed | CODEX |
CODEX [Cytokit + SPRM] | codex_cytokit_v1 | none | Processed | CODEX |
Multiplex Ion Beam Imaging | MIBI | Multiplex Ion Beam Imaging, mibi | Primary | MIBI |
LC-MS | LC-MS | none | Primary | LC-MS |
Lightsheet Microscopy | Lightsheet | none | Primary | Lightsheet |
Bulk RNA-seq | bulk-RNA | none | Primary | RNASeq |
snATAC-seq | snATAC-seq | none | Primary | ATACSeq |
snRNA-seq | snRNA-seq | none | Primary | RNASeq |
H&E Slide Staining | Stained Slides | none | Primary | Histology |
Multiplex Ion Beam Imaging [DeepCell + SPRM] | mibi_deepcell | none | Processed | MIBI |
scRNA-seq (10x Genomics v2) | scRNAseq-10xGenomics-v2 | none | Primary | RNASeq |
scRNA-seq (10x Genomics v3) | scRNAseq-10xGenomics-v3 | scRNA-Seq(10xGenomics), scRNA-Seq-10x, scRNAseq-10xGenomics | Primary | RNASeq |
snATAC-seq [SnapATAC] | sn_atac_seq | sn_atac_seq_multiome_10x | Processed | ATACSeq |
snRNA-seq (10x Genomics v3) | snRNAseq-10xGenomics-v3 | snRNAseq, snRNAseq-v3 | Primary | RNASeq |
snRNA-seq [Salmon] | salmon_sn_rnaseq_10x | none | Primary | RNASeq |
v1 Display Name in Portal | data_type | alt-names | Primary/Processed |
---|---|---|---|
CITE-Seq | CITE-Seq | none | Primary |
CosMX (RNA) | CosMX (RNA) | none | Primary |
DBiT-seq | DBiT-seq | none | Primary |
FACS - Fluorescence-activated Cell Sorting | FACS - Fluorescence-activated Cell Sorting | none | Primary |
GeoMX (RNA) | GeoMX (RNA) | none | Primary |
Mint-ChIP | Mint-ChIP | none | Primary |
SASP | SASP | none | Primary |
image_pyramid | image_pyramid | none | n/a |
Publication | publication | none | n/a |
Publication Ancillary | publication_ancillary | none | n/a |
In v1, these would have been considered values of data_type for derived datasets. A new Dataset node in UBKG would have been created, mapped to a Dataset Data Type node with value=the new pipeline.
The plan is not to add information on these pipelines to UBKG.
Pipeline | Dataset Type |
---|---|
salmon_rnaseq_10x_v2 | RNASeq |
salmon_rnaseq_10x_v2_sn | RNASeq |
sc_atac_seq_sn | ATACSeq |
These have not yet been defined in UBKG for SENNET
UBKG CSVs containing new soft-assay hierarchy in Globus.
Statement of problem
Datasets are being reorganized to support dynamic pipeline assignment--aka "soft assay type".
Reference
In the old infrastructure ("hard assay types", I guess), a static data_type/assay_type was created for every new combination of primary dataset and pipeline. For example, salmon_rnaseq_10x_v2 is a data_type for the result of a pipeline processing of primary datasets with data_type scRNAseq-genomics-v2.
In the new infrastructure,
Proposed solution
Optional: