x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)

MIT License

2 stars 0 forks source link

Standardize descriptions of HubMAP/SenNet dataset types via UBKG #138

Closed AlanSimmons closed 2 months ago

AlanSimmons commented 3 months ago

Statement of problem

There are cases in which a dataset type will have multiple alternative descriptions associated with it--i.e., strings with different spellings, different uses of case, etc. This results in dataset types that have multiple facets in faceted search in the UI. Following is an example for seqFISH datasets:

This spreadsheet summarizes a gap analysis that identifies descriptions that need to be standardized. There are at least two Slack threads in the multi-assay private channel that discuss the issue, too.

Proposed solution

Update the UBKG data to model the new set of dataset types, along with agreed standard descriptions.
Update endpoints in the UBKG API related to datasets to reflect changes in the UBKG data

AlanSimmons commented 3 months ago

Reason for problem: Analysis

tl;dr

The deployment of "soft assays" and the assay classifier introduced multiple sources of truth for dataset descriptions.

There are currently multiple sources of description for dataset types, including:

the Entity API
the assay classifier (aka Rules Engine)

Before the deployment of "soft assays", dataset metadata specific to the UI (including description) were managed in assay_types.yaml in the search-api repo. The YAML file assumed a static "data type" that corresponded to a key for processing workflows (also known as "assay type").

The HuBMAP/SenNet UBKG modeled and extended assay_types.yaml, as can be seen in the current datasets endpoint.

With the deployment of the Rules Engine, the UBKG stopped being a reliable source of truth for dataset metadata. Metadata such as descriptions for dataset types became a product of the Rules Engine.

AlanSimmons commented 3 months ago

Plan

I believe that this is mainly a task of adding new content to the existing UBKG data model, as opposed to enhancing the existing data model.

Compare current UBKG content against the gap analysis spreadsheet, with reference to the Rules Engine.
Update UBKG content.
Enhance UBKG API endpoints.

AlanSimmons commented 2 months ago

https://docs.google.com/spreadsheets/d/1PYe1NYZ8RXJ1_sp6C7Rjtock09x7mjppERpzIZU6S58/edit?usp=sharing

Analysis and plan