The whole workflow of the PEDIA project uses snakemake to run a pipeline together with conda/bioconda to install the necessary programs. So pelase get familiar with both BEFORE starting the workflow. A good start is the snakemake tutorial.
Go have a look at the miniconda website. Be sure that you choose the right version depending on your python version. To find out what python version you have please type in
python --version
Let's go into the data folder to download external files.
cd data
Now we will generate a shell environment with all necessary programs for downloading and process files. The software needed for the downloading is in the environment.yaml
file. Conda can read it:
conda env create -f environment.yaml
Now we created an enviroment called pedia_download
. We can activate it and we should have snakemake installed.
source activate pedia_download
snakemake -h
We can deactivate the environment using source deactivate
. The command conda env list
will list you all environments.
Now lets run the pedia download workflow. We can make a "dry run" using snakemake -n
to see what snakemake would probably do.
python3 -m venv <dir of choice, eg env>
./<env>/bin/activate (command can vary depending on used shell)
pip install -r requirements.txt
Optional: Run tests
tests/data/cases/51702.json
tests/data/genomics_entries/2669.json
tests/data/config.ini -- necessary for API keys
python3 -m unittest discover
Obtain additional data files - put them into the specified locations
data/mim2gene.txt
- publicly available on omim.orgdata/morbidmap.txt
- requires API-key access to downloadFiles inside scripts/
can be used for inspiration for own usecases. All
scripts should be run from the project base directory to automatically include
the lib package containing the actual program code.
Most configuration options are in a config.ini
file, with options commented.
A config.ini.SAMPLE
in the project directory can be used as reference for
creating an own configuration.
HGVS variant overrides are specified in hgvs_errors.json
. Which is per-default
searched for in the project root.
The hgvs version is specified in lib/constants.py
and will cause an error if
an hgvs errors file of not at least the specified version is found.
The number can be lowered manually to accept older hgvs error files.
A version of 0 will accept no hgvs_errors file.
Since some steps depend on the existence of API keys, running the preprocess.py script without a configuration file will not work.
Keep in mind that the virtual environment needs to be enabled for script execution.
The preprocess.py script contains most information necessary for running a conversion of new json files into the old format necessary for conversion.
# do not forget to activate the previously created virtual environment
# get a list of usable options
./preprocess.py -h
# run complete process with AWS synchronization
./preprocess.py
# run for a single file (specifying output folder is beneficial)
./preprocess.py -s PATH_TO_FILE -o OUTPUT_FOLDER
There are three steps to run pipeline.
Environment setup
source activate pedia_download
snakemake all
Download cases and perform preprocessing
python3 preprocess.py
"vcf": [
"28827.vcf.gz"
],
Get JSON files of simulated cases and real cases
To obtain the CADD scores of variants, we need to annotate the VCF files and retrieve the CADD score and append it to the geneList in JSON file. Now, we go to 3_simulation folder and activate simulation environment.
Note: you could skip this step by running the experiemnt in classifier. The classifier will trigger this subworkflow to generate JSON files.
Before we start, we would like to explain the two experiments we want to conduct in this study. First one is that we want to perform cross-validation on all cases to evaluate the performance among three simulation samples (1KG, ExAC and IRAN). The second one is that we want to train the model with simulated cases and test on the real cases. To achieve these two goals, we have the following command to perform simulation and generate the final JSON files.
To perform the CV experiment, we run the following command to obtain the JSON files simulated from 1KG, ExAC and IRAN data. You could replace 1KG with ExAC and IRAN
snakemake performanceEvaluation/data/CV/1KG.csv
To peform the second experiemnt, we run the following command to obtain the training and testing data sets. Generate the JSON files of real cases the output will be in 3_simulation/json_simulation/real/test
snakemake createCsvRealTest
Generate the JSON files of simulated cases the output will be in 3_simulation/json_simulation/real/train/1KG. You could replace 1KG with ExAC and IRAN
snakemake performanceEvaluation/data/Real/train_1KG.csv
The final JSON files are in 3_simulation/json_simulation folder.
Cross-validation evaluation
snakemake -p --cores 3 CV_all
snakemake ../output/cv/CV_1KG/run.log
Train and test evaluation
snakemake ../output/real_test/1KG/run.log
Train with all cases and test on patient with unknown diagnosis
snakemake ../output/test/1KG/21147/21147.log
How to read the PEDIA results?