Codebase associated with Reusability report: Prostate cancer stratification with diverse biologically-informed neural architectures. We re-implement the neural network architecture from Biologically informed deep neural network for prostate cancer discovery in PyTorch.
Additionally, we implement 3 different kinds of graph architectures, including a simple graph convolutional network, a graph attention networ and MetaLayer. Graphs are constructured using gene connectivity patterns from the HumanBase database.
Start by setting up the environment: switch to the repo folder and run
conda env create -f environment.yml
Of course, you should first install Conda from here.
Choose yes
for running conda init
. You can disable automatically activating the
base Conda environment by running conda config --set auto_activate_base false
. This is
useful if you work with both conda
and venv
environments on the same machine.
Next activate the environment and install the package using
conda activate cancerenv
pip install .
For development, make an editable install:
conda activate cancerenv
pip install -e .
NB the data files involved are ~20GB. If you want to store these outside the repo, we suggest setting cancer-net/data
as a symlink to elsewhere on your system where you would like to store the data. Then proceed with the following steps:
bash pull_data.sh
to download data files.python3 01-network_id_conversion.py
inside the conda environment. This converts the HumanBase genes in entrez ID to HGNC gene symbol identifiers in TCGA. PnetDataSet
, the code will use these HGNC gene symbol identifiers to construct a graph with connections between genes meeting a certain connection threshold (we have set 0.5
in our results). This can take ~20 minutes to construct, and so this graph is cached and saved as a pickle object. NB this process requires a high memory node, ~128GB. Next time a PnetDataSet
is intialised, it will load the cached graph if it can find one, instead of reconstructing the graph every time.Example notebooks in demos folder. These notebooks will load a dataset, split into train/validation/test splits, and train a neural network on the data. These notebooks also include calculations of various performance metrics.
To produce the results in Reusability report: Prostate cancer stratification with diverse biologically-informed neural architectures, we used the scripts in reprod_report. This folder contains scripts for both the hyperparameter sweeps and initialisation variance tests. We use Weights and biases to monitor model training and performance, so these scripts are reliant on wandb
. Additionally, model weights and performance metrics are saved locally in the wandb
save directory. NB that wandb
results for the runs in abs/2309.16645 can be viewed at the following links: