Open AlanSimmons opened 1 week ago
There are inconsistencies or gaps in the content of the testing_rule_chain.json file. The UBKG addresses these inconsistencies or gaps using business rules.
In the current rule engine, the contains_pii metadata is linked to the rule-based dataset at the level of dataset--e.g., for the rule-based dataset non-DCWG primary SNARE-ATACseq2, contains_pii = true.
This only works for HuBMAP, in which the source of all genetic information is human, and subject to Common Rule privacy restrictions.
SenNet will include samples from a variety of sources, including murine and organoid. The source will need to be considered when evaluating whether a rule-based dataset contains PII.
To account for this, we will make the model change:
New assays have been integrated into HuBMAP since the publication of the Nature paper. Following are Fig 2 mappings for dataset types that were not in the original paper.
Dataset type | Fig 2 Aggregated assay type | Fig 2 Modality | Fig2 Category |
---|---|---|---|
Histology | Histology | Brightfield microscopy | imaging |
Molecular Cartography | Spatial Transcriptomics | imaging | |
10x Multiome | Single-cell multiomics | single-cell | |
confocal | Label free imaging | imaging | |
CosMx | Spatial Transcriptomics | imaging | |
CyCIF | Antibody-based imaging | imaging | |
DBiT | Single-cell multiomics | single-cell | |
DESI | MS-based imaging | imaging | |
Enhanced Stimulated Raman Spectroscopy (SRS) | Label-free imaging | imaging | |
GeoMx (nCounter) | Single-cell multiomics | single-cell | |
GeoMx (NGS) | Single-cell multiomics | single-cell | |
HiFi-Slide | Spatial Transcriptomics | imaging | |
nanoSPLITS | MS-based imaging | imaging | |
PhenoCycler | Antibody-based imaging | imaging | |
RNAseq (with probes) | RNASeq | Transcriptomics | bulk |
Second Harmonic Generation | Label free imaging | imaging | |
Thick section Multiphoton MxIF | Antibody-based imaging | imaging | |
Visium (no probes) | Spatial Transcriptomics | imaging | |
Visium (with probes) | Spatial Transcriptomics | imaging | |
Xenium | Spatial Transcriptomics | imaging | |
Visium (used in SenNet) | Spatial Transcriptomics | imaging | |
MUSIC | Transcriptomics | bulk | |
DART-Fish | Spatial Transcriptomics | imaging | |
Slideseq | Single-cell omics | single-cell | |
MERFISH | Spatial Transcriptomics | imaging | |
3D Imaging Mass Cytometry | LC-MS | Proteomics | bulk |
Each rule-based dataset has three properties that are similar:
In the testing rules, the properties can use the same text string, but with different cases. For example, in non-DCWG primary seqFish, assaytype is "seqFish" and _datasettype and description are both "seqFISH".
The SimpleKnowledge spreadsheet that is the source of the UBKG ontology requires unique terms for codes. In addition, the SimpleKnowledge spreadsheet rules are case-insensitive. This means, for example, that there cannot be codes with terms "seqFISH" and "seqFish".
The case issue is a factor in a number of rule-based datasets. The usual manifestation of the issue involves one of the three terms differing from the other two.
To allow for multiple terms with the same text but different case, an appropriate appendix is applied to the term that needs a different case. Any query that works with the term will need to strip the appendix.
For example, for the seqFISH rule, the assaytype term is set to "seqFish_assaytype". The query that returns the assaytype term uses REPLACE to strip the _assaytype appendix.
The other two possible appendices are datasettype and description_.
Rules where the appendices are used: | Rule | appendix |
---|---|---|
non-DCWG primary seqFish | seqFish_assaytype | |
DCWG phenocycler | phenocycler_assaytype | |
DCWG cycif | cycif_assaytype | |
DCWG merfish | merfish_assaytype | |
DCWG confocal | confocal_assaytype | |
DCWG 10x-multiome | 10x Multiome_description |
I propose a new UBKG model to address three separate requests.
Requests
{ "type": "match", "match": "not_dcwg and is_primary and assay_type in ['CODEX']", "value": "{'assaytype': 'CODEX', 'dir-schema': 'codex-v1', 'tbl-schema': 'codex-v'+version.to_str, 'vitessce-hints': [], 'contains-pii': false, 'primary': true, 'description': 'CODEX', 'dataset-type': 'CODEX' }", "rule_description": "non-DCWG primary CODEX" }
Solution
The UBKG SimpleKnowledge source for HuBMAP should be modified to support the model expressed in this diagram.
This diagram models only CODEX assay information. We think that the CODEX rules are the most complex and contain all of the appropriate metadata.
How to interpret the model diagram
All nodes are encoded. The diagram shows the term. For example, the node with term DCWG CODEX would actually have a code like HUBMAP:C000099. Relationships are isa unless indicated otherwise. For example, DCWG CODEX isa rule-based dataset and DCWG CODEX has_dataset_type CODEX.
API
The hs-ontology-api endpoints for datasets, assaytype, and assayname will need to be refactored or replaced entirely.
CEDAR integration
The CEDAR/HMFIELD ingestions will likely need to be updated to match the new model.