Closed AlanSimmons closed 2 months ago
tl;dr
The deployment of "soft assays" and the assay classifier introduced multiple sources of truth for dataset descriptions.
There are currently multiple sources of description for dataset types, including:
Before the deployment of "soft assays", dataset metadata specific to the UI (including description) were managed in assay_types.yaml in the search-api repo. The YAML file assumed a static "data type" that corresponded to a key for processing workflows (also known as "assay type").
The HuBMAP/SenNet UBKG modeled and extended assay_types.yaml, as can be seen in the current datasets endpoint.
With the deployment of the Rules Engine, the UBKG stopped being a reliable source of truth for dataset metadata. Metadata such as descriptions for dataset types became a product of the Rules Engine.
I believe that this is mainly a task of adding new content to the existing UBKG data model, as opposed to enhancing the existing data model.
Statement of problem
There are cases in which a dataset type will have multiple alternative descriptions associated with it--i.e., strings with different spellings, different uses of case, etc. This results in dataset types that have multiple facets in faceted search in the UI. Following is an example for seqFISH datasets:
This spreadsheet summarizes a gap analysis that identifies descriptions that need to be standardized. There are at least two Slack threads in the multi-assay private channel that discuss the issue, too.
Proposed solution