General

There is a difference in overall ingestion logic.

old method: denormalized dataset data type (assay_type) associated with a workflow—e.g., assay_1 uses workflow_1
new method: data type derives from a base type and conditions determined from the metadata—e.g., assay_1 is based on assay_0 + {has metadata 1 value=X and metadata value 2 = Y}

Tests are specified by a rules engine, which is a Python package

Goals:

Store rules logic to the degree possible in UBKG
Obtain rule logic from UBKG

Rules engine

The Rule engine implements logic via a set of chained tests. Rules are of two types:

"Get" - returns the result of a query
"Set" - modifies state for rule
Very high-level example of rule.
Test 1: Is this dataset from the new (CEDAR) schema, or the old (pre-CEDAR) schema?
- test 1 logic
- test 1 result that sets state—e.g., old_style=true
Test 2: Old style assay type
- test 2 logic - uses state (if old_style=true...)
- test 2 result: old assay type
Test 3: New style template
- test 3 logic - uses combination of state (if old_style=false) and metadata (e.g., template name = x)
- test 3 result: new assay type

Tests run in order. Test results can be in various formats, including JSON. Rule logic is expressed per a syntax. Some of the returns from rules may require valuesets of some sort. The example that we discussed was the set of Vitessce hints

UBKG - ETL

Rule configuration should be in a resource external to Rules Engine. The expressed desire is to represent as a graph the rule logic decomposed to the resolution of individual element. For example, if a rule can be expressed as X = A AND (B OR C), then we would want nodes for X, A, B, and C, along with edges between X and A, A and B, etc. However, initially, we may have to store at lower resolution—e.g., a single node with "X = A AND (B OR C)". The graph design must wait for more information. We need examples of what we would be representing—i.e., output of results rules. The examples should span the possible range of returns: this means that we need to know more about the set of new datasets. UBKG ETL would parse returns from rules engine into edge (assertion) and node metadata files. Potential issue: We discussed storing some results logic information as properties of nodes. UBKG ETL assumes a certain structure for node properties—i.e., a node can only have value, lowerbound, upperbound, and unit properties. If we define new properties for nodes related to rules logic, we might need to represent these as "property nodes"--e.g., instead of a node property "color = blue", we define a blue node that isa color node and then link to the node with a "has_color" edge.

UBKG-API

The UBKG-API will need endpoints that return results logic. At this time, we think that the primary consumer of these endpoints would be the rules engine. The UI would query the rules engine directly.

x-atlas-consortia / ubkg-neo4j

HuBMAP: UBKG support for new "soft" assay types #32

General

Goals:

Rules engine

Very high-level example of rule.

UBKG - ETL

UBKG-API