x-atlas-consortia / ubkg-etl

A framework that combines data from the UMLS with assertions from other data sources into a set of CSV files that can be imported into neo4j to build a Unified Biomedical Knowledge Graph (UBKG)
MIT License
2 stars 0 forks source link

Turtle to OWLNETS issue: @prefix and namespaces #36

Closed AlanSimmons closed 1 year ago

AlanSimmons commented 1 year ago

Statement of Problem

Using PheKnowLator to process OWL files in Turtle serialization can introduce issues with namespaces (SABs) in the resulting OWLNETS files. This is an issue for any OWL file that is available in Turtle, including:

Details

PheKnowLator assumes RDF/XML as input. To work with an OWL file that is in another serialization, it is necessary first to convert to RDF/XML. The generation framework does this using the rdflib package.

For files in Turtle format, the generation framework parses the file in TTL and then serializes to XML.

graph = Graph().parse(owl_file,format='ttl')
    convertedpath = os.path.join(owl_dir,'converted.owl')
    v = graph.serialize(format='xml', destination=convertedpath)
    graph2 = Graph().parse(convertedpath, format='xml')
    graph = graph2

Turtle files contain a prefix section that associates portions of URIs with namespaces. Following are examples of prefixes from the NPO Turtle file:

@prefix AllenTransgenicLine: <http://api.brain-map.org/api/v2/data/TransgenicLine/> .
@prefix BFO: <http://purl.obolibrary.org/obo/BFO_> .
@prefix ILX: <http://uri.interlex.org/base/ilx_> .
@prefix ilxr: <http://uri.interlex.org/base/readable/> .
@prefix ilxtr: <http://uri.interlex.org/tgbugs/uris/readable/> .

When the Turtle file is serialized to XML, namespace prefixes are translated, and so are lost to PheKnowLator.

Example

Turtle

@prefix ILX: <http://uri.interlex.org/base/ilx_>
.
.
.
ILX:0101528 a owl:Class ;
    rdfs:label "CA2 alveus" ;
    rdfs:subClassOf UBERON:0002305,
        [ a owl:Restriction ;
            owl:onProperty ilx.partOf: ;
            owl:someValuesFrom UBERON:0007639 ] .

Translated RDF/XML

<rdf:Description rdf:about="http://uri.interlex.org/base/ilx_0101528">
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
    <rdfs:label>CA2 alveus</rdfs:label>
    <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/UBERON_0002305"/>
    <rdfs:subClassOf rdf:nodeID="na9ff5cff0e4544b79582c69e889226d4b13"/>
  </rdf:Description>

OWLNETS_edgelist.txt:

http://uri.interlex.org/base/ilx_0101528    http://www.w3.org/2000/01/rdf-schema#subClassOf http://purl.obolibrary.org/obo/UBERON_0002305

(PheKnowLator only translates information OWL information relevant to knowledge graphs.)

If the Turtle prefix is an IRI that is similar to a OBO IRI (such as the Interlex IRI above), then it may be possible to define a namespace. However, prefixes such as @prefix ilxtr: <http://uri.interlex.org/tgbugs/uris/readable/> do not translate to a OBO equivalent.

Solution Options

We need to obtain the original namespace prefixes from the Turtle file--in effect, translate from the full IRIs in the OWLNETS files back to the Turtle prefixes.

The most straightforward way would be simply to add more "special cases" to the existing codeReplacements function. This would be justified in that:

  1. We could select only those prefixes that relate to the nodes of interest.
  2. The set of possible cases is likely to be small. We're only dealing with a handful of Turtle files (<5 ) for the initial round.
  3. The Turtle files are already published.
  4. Authors of Turtle files can argue that the Turtle files are in a legitimate format, and it's up to us to translate them correctly to OWLNETS. The issues actually arise from the need to serialize Turtle to RDF/XML before running PheKnowLator.

We could automate this to a degree by having the framework read the original Turtle files and extract namespaces from the prefixes. However, we would need to provide a list of Turtle files to read, and some of the prefixes are actually for relationships. This does not seem to be much better a solution than adding special cases manually.

AlanSimmons commented 1 year ago

Solution

I developed a solution to handle Turtle files that winds up addressing larger issues of code maintenance for the codeReplacements function.

There are three basic types of conversions required in codeReplacements:

  1. Codes from UMLS SABs
  2. Codes with IRIs that have prefixes that are not in the expected format of prefix/SAB_code, but that can be handled with a simple replacement of the prefix with a SAB. Most of these originate with Turtle conversions, but there are some from other sources.
  3. Codes with IRIs that require more complicated formatting, such as those from EDAM.

The codeReplacements function continues to contain the logic for handling the UMLS nodes and the truly special cases (1 and 3 above). However, for the case of simple prefix-SAB mappings, the function now reads a CSV file in the application directory named prefixes.csv.

The use of the prefixes.csv resource file should make it easier to respond to new sets of assertions without significant modification of the logic in the codeReplacements function.