Requests

The UBKG should be the source of truth for the assay metadata currently managed in the value object of the soft assay testing rules json—e.g.,

{ "type": "match", "match": "not_dcwg and is_primary and assay_type in ['CODEX']", "value": "{'assaytype': 'CODEX', 'dir-schema': 'codex-v1', 'tbl-schema': 'codex-v'+version.to_str, 'vitessce-hints': [], 'contains-pii': false, 'primary': true, 'description': 'CODEX', 'dataset-type': 'CODEX' }", "rule_description": "non-DCWG primary CODEX" }

The UBKG should organize dataset types per the schema described in Figure 2 of the Nature perspective paper. Details are here.
The UBKG should support annotation of datasets in terms of measurement assays that can be cross-referenced to OBI terms (Column F in the spreadsheet).

Solution

The UBKG SimpleKnowledge source for HuBMAP should be modified to support the model expressed in this diagram.

This diagram models only CODEX assay information. We think that the CODEX rules are the most complex and contain all of the appropriate metadata.

How to interpret the model diagram

All nodes are encoded. The diagram shows the term. For example, the node with term DCWG CODEX would actually have a code like HUBMAP:C000099. Relationships are isa unless indicated otherwise. For example, DCWG CODEX isa rule-based dataset and DCWG CODEX has_dataset_type CODEX.

API

The hs-ontology-api endpoints for datasets, assaytype, and assayname will need to be refactored or replaced entirely.

CEDAR integration

The CEDAR/HMFIELD ingestions will likely need to be updated to match the new model.

AlanSimmons commented 1 week ago

Inconsistencies and missing data in testing_rule_chain.json

There are inconsistencies or gaps in the content of the testing_rule_chain.json file. The UBKG addresses these inconsistencies or gaps using business rules.

Derived classifications do not contain a dataset_type key. Other keys, such as vitessce_hints, are present even if there are no values. The UBKG assigns to a derived classification the dataset_type for the associated primary classification.
Assays not listed in the Fig 2 table and/or the Pipeline Decision Rules document:
- DESI
- IMC3D
- Labeled LC-MS
- Label-free LC-MS/MS
- Labeled LC-MS/MS
- Untargeted LC-MS
Primary assays are always assigned a measurement assay (mapped to OBI).
NanoPOTS has a dataset_type of UNKNOWN, which is not coded in HRAVS. Mapped to Unknown (HUBMAP:C015001, cross-referenced to UMLS: C0439673).
MxIF has dataset_type of UNKNOWN.
Publication has no dataset_type.
SIMS-IMS does not have an OBI code in the Fig 2 mapping. The measurement assay was mapped to UMLS:C0242851.
WGS mapped to OBI:0002117.
MS mapped to OBI:0000470.
GeoMx mapped to NCIT:C181933.
10x Multiome mapped to EFO:0030059.
PhenoCycler mapped to EFO:0700002
CyCIF mapped to NCIT:C181929
MERFISH mapped to EFO:0008992
The non-DCWG primary DESI rule-based dataset has DESI as assaytype; DCWG DESI-IMS has DESI-IMS.
nanoSPLITS mapped to OBI:0003102, which is technically nanoPOTS.
Confocal Microscopy mapped to NCIT:C17753.
Enhanced Stimulated Raman Spectroscopy mapped to NCIT:C17157.
No measurement assay for Molecular Cartography.
DCWG 10x-multiome has assaytype=10x-multiome and derived multiome 10x has assaytype=multiome-10x. Using 10x-multiome.
No measurement assay for MUSIC.

AlanSimmons commented 1 week ago

Change to model: contains_pii is at assay level

In the current rule engine, the contains_pii metadata is linked to the rule-based dataset at the level of dataset--e.g., for the rule-based dataset non-DCWG primary SNARE-ATACseq2, contains_pii = true.

This only works for HuBMAP, in which the source of all genetic information is human, and subject to Common Rule privacy restrictions.

SenNet will include samples from a variety of sources, including murine and organoid. The source will need to be considered when evaluating whether a rule-based dataset contains PII.

To account for this, we will make the model change:

The measurement assay will have an assertion of contains full_genetic_sequences. For the example of non-DCWG primary SNARE-ATACseq2, the measurement assay is SNARE-ATACseq.
Business logic in the rule will consider sample source.

AlanSimmons commented 5 days ago

Fig 2 mappings for new dataset types

New assays have been integrated into HuBMAP since the publication of the Nature paper. Following are Fig 2 mappings for dataset types that were not in the original paper.

Dataset type	Fig 2 Aggregated assay type	Fig 2 Modality	Fig2 Category
Histology	Histology	Brightfield microscopy	imaging
Molecular Cartography	Spatial Transcriptomics	imaging
10x Multiome	Single-cell multiomics	single-cell
confocal	Label free imaging	imaging
CosMx	Spatial Transcriptomics	imaging
CyCIF	Antibody-based imaging	imaging
DBiT	Single-cell multiomics	single-cell
DESI	MS-based imaging	imaging
Enhanced Stimulated Raman Spectroscopy (SRS)	Label-free imaging	imaging
GeoMx (nCounter)	Single-cell multiomics	single-cell
GeoMx (NGS)	Single-cell multiomics	single-cell
HiFi-Slide	Spatial Transcriptomics	imaging
nanoSPLITS	MS-based imaging	imaging
PhenoCycler	Antibody-based imaging	imaging
RNAseq (with probes)	RNASeq	Transcriptomics	bulk
Second Harmonic Generation	Label free imaging	imaging
Thick section Multiphoton MxIF	Antibody-based imaging	imaging
Visium (no probes)	Spatial Transcriptomics	imaging
Visium (with probes)	Spatial Transcriptomics	imaging
Xenium	Spatial Transcriptomics	imaging
Visium (used in SenNet)	Spatial Transcriptomics	imaging
MUSIC	Transcriptomics	bulk
DART-Fish	Spatial Transcriptomics	imaging
Slideseq	Single-cell omics	single-cell
MERFISH	Spatial Transcriptomics	imaging
3D Imaging Mass Cytometry	LC-MS	Proteomics	bulk

AlanSimmons commented 5 days ago

Case differences in assaytype, dataset_type, and description

Each rule-based dataset has three properties that are similar:

assaytype, which corresponds to the workflow key
_datasettype, which is a categorization
description, which is how datasets that are of the type indicated by the rule-based dataset are displayed.

In the testing rules, the properties can use the same text string, but with different cases. For example, in non-DCWG primary seqFish, assaytype is "seqFish" and _datasettype and description are both "seqFISH".

The SimpleKnowledge spreadsheet that is the source of the UBKG ontology requires unique terms for codes. In addition, the SimpleKnowledge spreadsheet rules are case-insensitive. This means, for example, that there cannot be codes with terms "seqFISH" and "seqFish".

The case issue is a factor in a number of rule-based datasets. The usual manifestation of the issue involves one of the three terms differing from the other two.

Workaround

To allow for multiple terms with the same text but different case, an appropriate appendix is applied to the term that needs a different case. Any query that works with the term will need to strip the appendix.

For example, for the seqFISH rule, the assaytype term is set to "seqFish_assaytype". The query that returns the assaytype term uses REPLACE to strip the _assaytype appendix.

The other two possible appendices are datasettype and description_.

Rules where the appendices are used:	Rule	appendix
non-DCWG primary seqFish	seqFish_assaytype
DCWG phenocycler	phenocycler_assaytype
DCWG cycif	cycif_assaytype
DCWG merfish	merfish_assaytype
DCWG confocal	confocal_assaytype
DCWG 10x-multiome	10x Multiome_description

x-atlas-consortia / ubkg-etl

update to UBKG model for soft assay #148