x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
2 stars 0 forks source link

update to UBKG model for soft assay #148

Open AlanSimmons opened 1 week ago

AlanSimmons commented 1 week ago

I propose a new UBKG model to address three separate requests.

Requests

  1. The UBKG should be the source of truth for the assay metadata currently managed in the value object of the soft assay testing rules json—e.g.,​

{ "type": "match", "match": "not_dcwg and is_primary and assay_type in ['CODEX']", "value": "{'assaytype': 'CODEX', 'dir-schema': 'codex-v1', 'tbl-schema': 'codex-v'+version.to_str, 'vitessce-hints': [], 'contains-pii': false, 'primary': true, 'description': 'CODEX', 'dataset-type': 'CODEX' }", "rule_description": "non-DCWG primary CODEX" }

  1. The UBKG should organize dataset types per the schema described in Figure 2 of the Nature perspective paper. Details are here.
  2. The UBKG should support annotation of datasets in terms of measurement assays that can be cross-referenced to OBI terms (Column F in the spreadsheet).

Solution

The UBKG SimpleKnowledge source for HuBMAP should be modified to support the model expressed in this diagram.

This diagram models only CODEX assay information. We think that the CODEX rules are the most complex and contain all of the appropriate metadata.

How to interpret the model diagram

All nodes are encoded. The diagram show​s the term. For example, the node with term DCWG CODEX would actually have a code like HUBMAP:C000099. Relationships are isa unless indicated otherwise. For example, DCWG CODEX isa rule-based dataset and DCWG CODEX has_dataset_type CODEX.

image

API

The hs-ontology-api endpoints for datasets, assaytype, and assayname will need to be refactored or replaced entirely.

CEDAR integration

The CEDAR/HMFIELD ingestions will likely need to be updated to match the new model.

AlanSimmons commented 1 week ago

Inconsistencies and missing data in testing_rule_chain.json

There are inconsistencies or gaps in the content of the testing_rule_chain.json file. The UBKG addresses these inconsistencies or gaps using business rules.

  1. Derived classifications do not contain a dataset_type key. Other keys, such as vitessce_hints, are present even if there are no values. The UBKG assigns to a derived classification the dataset_type for the associated primary classification.
  2. Assays not listed in the Fig 2 table and/or the Pipeline Decision Rules document:
    • DESI
    • IMC3D
    • Labeled LC-MS
    • Label-free LC-MS/MS
    • Labeled LC-MS/MS
    • Untargeted LC-MS
  3. Primary assays are always assigned a measurement assay (mapped to OBI).
  4. NanoPOTS has a dataset_type of UNKNOWN, which is not coded in HRAVS. Mapped to Unknown (HUBMAP:C015001, cross-referenced to UMLS: C0439673).
  5. MxIF has dataset_type of UNKNOWN.
  6. Publication has no dataset_type.
  7. SIMS-IMS does not have an OBI code in the Fig 2 mapping. The measurement assay was mapped to UMLS:C0242851.
  8. WGS mapped to OBI:0002117.
  9. MS mapped to OBI:0000470.
  10. GeoMx mapped to NCIT:C181933.
  11. 10x Multiome mapped to EFO:0030059.
  12. PhenoCycler mapped to EFO:0700002
  13. CyCIF mapped to NCIT:C181929
  14. MERFISH mapped to EFO:0008992
  15. The non-DCWG primary DESI rule-based dataset has DESI as assaytype; DCWG DESI-IMS has DESI-IMS.
  16. nanoSPLITS mapped to OBI:0003102, which is technically nanoPOTS.
  17. Confocal Microscopy mapped to NCIT:C17753.
  18. Enhanced Stimulated Raman Spectroscopy mapped to NCIT:C17157.
  19. No measurement assay for Molecular Cartography.
  20. DCWG 10x-multiome has assaytype=10x-multiome and derived multiome 10x has assaytype=multiome-10x. Using 10x-multiome.
  21. No measurement assay for MUSIC.
AlanSimmons commented 1 week ago

Change to model: contains_pii is at assay level

In the current rule engine, the contains_pii metadata is linked to the rule-based dataset at the level of dataset--e.g., for the rule-based dataset non-DCWG primary SNARE-ATACseq2, contains_pii = true.

This only works for HuBMAP, in which the source of all genetic information is human, and subject to Common Rule privacy restrictions.

SenNet will include samples from a variety of sources, including murine and organoid. The source will need to be considered when evaluating whether a rule-based dataset contains PII.

To account for this, we will make the model change:

  1. The measurement assay will have an assertion of contains full_genetic_sequences. For the example of non-DCWG primary SNARE-ATACseq2, the measurement assay is SNARE-ATACseq.
  2. Business logic in the rule will consider sample source.
AlanSimmons commented 5 days ago

Fig 2 mappings for new dataset types

New assays have been integrated into HuBMAP since the publication of the Nature paper. Following are Fig 2 mappings for dataset types that were not in the original paper.

Dataset type Fig 2 Aggregated assay type Fig 2 Modality Fig2 Category
Histology Histology Brightfield microscopy imaging
Molecular Cartography Spatial Transcriptomics imaging
10x Multiome Single-cell multiomics single-cell
confocal Label free imaging imaging
CosMx Spatial Transcriptomics imaging
CyCIF Antibody-based imaging imaging
DBiT Single-cell multiomics single-cell
DESI MS-based imaging imaging
Enhanced Stimulated Raman Spectroscopy (SRS) Label-free imaging imaging
GeoMx (nCounter) Single-cell multiomics single-cell
GeoMx (NGS) Single-cell multiomics single-cell
HiFi-Slide Spatial Transcriptomics imaging
nanoSPLITS MS-based imaging imaging
PhenoCycler Antibody-based imaging imaging
RNAseq (with probes) RNASeq Transcriptomics bulk
Second Harmonic Generation Label free imaging imaging
Thick section Multiphoton MxIF Antibody-based imaging imaging
Visium (no probes) Spatial Transcriptomics imaging
Visium (with probes) Spatial Transcriptomics imaging
Xenium Spatial Transcriptomics imaging
Visium (used in SenNet) Spatial Transcriptomics imaging
MUSIC Transcriptomics bulk
DART-Fish Spatial Transcriptomics imaging
Slideseq Single-cell omics single-cell
MERFISH Spatial Transcriptomics imaging
3D Imaging Mass Cytometry LC-MS Proteomics bulk
AlanSimmons commented 5 days ago

Case differences in assaytype, dataset_type, and description

Each rule-based dataset has three properties that are similar:

  1. assaytype, which corresponds to the workflow key
  2. _datasettype, which is a categorization
  3. description, which is how datasets that are of the type indicated by the rule-based dataset are displayed.

In the testing rules, the properties can use the same text string, but with different cases. For example, in non-DCWG primary seqFish, assaytype is "seqFish" and _datasettype and description are both "seqFISH".

The SimpleKnowledge spreadsheet that is the source of the UBKG ontology requires unique terms for codes. In addition, the SimpleKnowledge spreadsheet rules are case-insensitive. This means, for example, that there cannot be codes with terms "seqFISH" and "seqFish".

The case issue is a factor in a number of rule-based datasets. The usual manifestation of the issue involves one of the three terms differing from the other two.

Workaround

To allow for multiple terms with the same text but different case, an appropriate appendix is applied to the term that needs a different case. Any query that works with the term will need to strip the appendix.

For example, for the seqFISH rule, the assaytype term is set to "seqFish_assaytype". The query that returns the assaytype term uses REPLACE to strip the _assaytype appendix.

The other two possible appendices are datasettype and description_.

Rules where the appendices are used: Rule appendix
non-DCWG primary seqFish seqFish_assaytype
DCWG phenocycler phenocycler_assaytype
DCWG cycif cycif_assaytype
DCWG merfish merfish_assaytype
DCWG confocal confocal_assaytype
DCWG 10x-multiome 10x Multiome_description