# The integrated DIetary Supplements Knowledge base (iDISK)
iDISK is a comprehensive knowledge base of dietary supplement ingredients, products, and related concepts. This repository contains the source code required for building the knowledge base. The current version of the knowledge base, in both Neo4j and RRF formats, is available for download at https://doi.org/10.13020/d6bm3v or from the releases.
iDISK is intended to be used for educational purposes only and nothing contained therein should take the place of professional medical advice. The information provided in iDISK is obtained from secondary resources and does not necessarily reflect the views or opinions of those resources. You are responsible for checking the accuracy of relevant facts and opinions given in iDISK before entering into any commitment based upon them.
Rizvi RF, Vasilakes J, Adam TJ, Melton GB, Bishop JR, Bian J, Tao C, Zhang R. iDISK: the integrated DIetary Supplements Knowledge base. Journal of the American Medical Informatics Association. 2020 Apr;27(4):539-48.
iDISK/ # Top-level directory containing all iDISK related files. Set as the $IDISK_HOME environment variable.
README.md # This file.
lib/ # Functions and/or scripts common to every version of iDISK.
sources/ # Contains the source databases. See below.
versions/ # Contains directories containing different iDISK versions.
lib/
idlib/ # The iDISK API library. A Python package. See corresponding documentation for details.
config/ # The iDISK config files.
kb.ini # Defines basic information about each iDISK version.
linkers.conf # Instructions on how to instantiate various entity linkers.
prodigy.json # The prodigy recipe for synonymy annotation.
schemas/ # Where schema versions are stored, as Cypher files.
schemas.ini # Metadata about schema versions.
annotation/ # Scripts related to annotation of iDISK concepts and related data.
check_content.py # Checks source data against a schema to ensure consistency.
count_data_elements.py # Script for counting iDISK data elements in a given version.
filter_connections*.py # Scripts for determining which concepts to merge.
sources/
README.md # File describing the general process for importing source databases into iDISK.
NHP/ # The name of the source database.
08_01_2018/ # The date (MM_DD_YYYY) when the source files were downloaded.
README.md # Documentation for this download, including download URL, data version (if applicable), caveats, etc.
download/ # Contains the downloaded data files in their original raw format.
import/ # Contains the data files in a standard format for importing into iDISK.
preprocess/ # Files containing any intermediary preprocessing moving from download/ to import/.
concepts.jsonl # The file containing the concepts to import into iDISK.
scripts/ # Scripts for converting the source data in its raw format to the iDISK format.
12_01_2017/
.../
DSLD/
.../
versions/
1.0.0/
CHANGELOG.md # Changelog for this version of iDISK.
VERSION_INFO.txt # Config log for this verion.
concepts/ # Contains the intermediate files generated to build iDISK.
build/ # Contains the final iDISK data files.
Neo4j/ # The Neo4j files corresponding to this version of iDISK.
RRF/ # The RRF files corresponding to this version of iDISK.
.../ # iDISK in other output formats as required.
After cloning this repository, install idlib
, the iDISK API library.
idlib
is bundled with iDISK. Go to lib/idlib
and follow the instructions in the README.
The API documentation is available at lib/idlib/docs/_build/html/index.html
.
Before getting started, run
source activate idisk
Copy Makefile.example
to a nice, project-specific name like Makefile.myproject
. Then, in the new Makefile,
edit the variables under the PROJECT CONFIGURATION
section as necessary for your project and your working environment.
Then run,
make version
which will set up a new project folder at $(PROJECT_HOME)/versions/$(PROJECT_VERSION)
as specified in your Makefile.
N.B. If this is not the first ever version of the database, you are encouraged to fill out the changelog location at ${PROJECT_HOME}/versions/${PROJECT_VERSION}/CHANGELOG.md
with any changes or additions over the previous version.
iDISK is built using Neo4j, so the first step is to install the latest version of Neo4j.
Neo4j is used both to define the iDISK schema as well as to hold the final database. Here, we discuss creating and using
a schema. Neo4j graphs are specified using a query language called Cypher and we've supplied Cypher scripts for the
available iDISK versions at lib/config/schemas/schema_${SCHEMA_VERSION}.cypher
.
N.B. Schema versions are specified using semantic versioning (major.minor.patch). The schema versions are distinct from the iDISK versions. That is, if iDISK updates to a new version, the schema will not necessarily require updating. However, a new schema version always necessitates a new iDISK version.
Once you've installed Neo4j, open Neo4j Desktop, create a new graph, and start it (leaving it empty for now).
Take note of the username, password, and BOLT URI for this graph and enter them into the schema configuration file at
lib/schemas/schemas.ini
. Finally, run the following to create the schema graph:
make schema
Now, if you go back to Neo4j Desktop and open your graph in the Neo4j browser, you'll see it has been populated. You can
view the entire schema with the Cypher query MATCH(n) RETURN(n)
.
This project is specified by a few config files in lib/config/
. The main config file to consider is
kb.ini
, which, in addition to the schema, specifies certain constraints that apply to the data in each version iDISK.
The constraints are: 1) accepted data sources, 2) accepted term types, and 3) the concept types
schemas.ini
works alongside the Cypher schemas in lib/config/schemas/
to manage schema instances in Neo4j.
All source data for iDISK resides in the sources/
directory.
The rest of this guide assumes you have properly populated the sources/
directory.
You can find a detailed example of how to do this at sources/example_src
.
We'll check that the source data (i.e. the files belonging to the SOURCE_FILES
variable
in the Makefile
) is formatted properly. With the schema graph running,
make check_contents
This will check each file against the schema to make sure all concepts, attributes, relationships, etc.
are properly formatted. This script will print out the number of found issues for each source data file
and write any issues to a a log file at ${source_file}.error
.
Any concept or relationship object that we want to link to an existing terminology must have
a links_to
attribute as part of its corresponding type in the schema, the value of which is an key in lib/config/linkers.conf
.
Each entry in linkers.conf
specifies the required parameters to instantiate an
idlib.entity_linking.linkers.EntityLinker
, the name of which is given by the "class_name"
attribute.
For example, in schema 1.0.0 the links_to
attribute of the PD (pharmaceutical drug) concepts is
umls.quickumls.pd
. The entry under umls.quickumls.pd
in lib/config/linkers.conf
has the
"class_name": QuickUMLSDriver
, which is the corresponding class name, while the remaining attributes
specify the arguments to instantiate a QuickUMLSDriver
.
First, ensure that the Neo4j schema graph is running. Then, create or edit linkers.conf
as necessary to properly instantiate the corresponding EntityLinker
classes. Run entity linking by
make link_entities
This will likely take a few mintues. Check the progress in the log file at
$(PROJECT_VERSION)/concepts/concepts_linked.jsonl.log
.
The next step is to generate candidate synonymy connections between the concepts in the source files. This is implemented in two steps. The first is focused on recall, generating a candidate connection between two concepts if they share one or more atom terms. This step is run by,
make connections
Note that if there are any concept types that you do not want to consider matching, this can
be specified in the Makefile via the --ignore_concept_types
option in the connections
recipe.
Currently, only dietary supplement products (concept type DSP) are ignored.
These connections can be used directly, but it is advisable to filter them to improve precision. Two methods of filtering connections are implemented: The first removes connections based on some simple rules (specifically, the two concepts must be linked to the same entry in an external terminology in the previous step); the second removes connections using human annotations.
To run the first method:
make filter_connections
See below for filtering according to human annotation.
iDISK implements the Prodigy annotation tool for classifying connected pairs as one of the following labels:
If you are qualified to use Prodigy (it's not free) and have it installed in the idisk
virtual environment,
you can run the annotation task with:
make run_annotation
Once the annotation is complete, filter the connections according to the annotations with
make filter_connections_ann
Finally, now that we're confident in our connected concepts, we can merge them with
make merge
The output of make merge
is a JSON file in the iDISK format that is perfectly usable as a knowledge base within Python using the idlib
API.
>>> import idlib
>>> kb = idlib.load_kb("path/to/version/directory")
>>>
>>> # Print the first relationship of the first Concept and the first relationship of its object Concept.
... for rel in kb[1].get_relationships():
... print(rel)
... for rel2 in rel.object.get_relationships():
... print(rel2)
... break
... break
DSLD0001355: Met-Rx - Pure Protein Shake Vanilla Cream **has_ingredient** NHPID_DSLD_MSKCC0478268: Vitamin A
NHPID_DSLD_MSKCC0478268: Vitamin A **interacts_with** MSKCC0481800: Retinoids
Neo4j is a much better option for storing and using the knowledge base. To create the Neo4j version of iDISK, first create an emtpy Neo4j graph. Then run
make neo4j
This command will populate the graph with the iDISK data elements. It will take a few minutes. Once the database is populated it can be exported by running
bin/neo4j-admin dump --database=graph.db \
--to=/absolute/path/to/destination/idisk-neo4j-<version>-<year>-<month>-<date>.dump
In the Neo4j Terminal tab.
We have provided example Neo4j queries at doc/iDISK_Neo4j_example_queries.pdf
.
The RRF data file format is alternative output format for iDISK. Used by the Unified Medical Language System (UMLS), it consists of a set of pipe-delimited
flat files: MRSTY.RRF
(the types of the concepts), MRCONSO.RRF
(the atoms for each concept), MRSAT.RRF
(the attributes), and MRREL.RRF
(the relationships).
Create these files by running
make rrf
While the resulting files of the make rrf
command are technically of the same format as the files used by the UMLS, they are not strictly compatible. It is possible
to extend the UMLS Metathesaurus files with the iDISK terminology (in a very crude fashion). While there is no Makefile recipe for this yet, it can be done by running
python lib/idlib/idlib/formatters/umls.py --idisk_version_dir versions/$(PROJECT_VERSION) \
--outdir versions/$(PROJECT_VERSION)/build/UMLS/$(METATHESAURUS_VERSION)/ \
--umls_mth_dir /path/to/umls/rrf/files \
--use_semtypes lib/idlib/idlib/formatters/semantic_types_2016.txt