xomicsdatascience / transcriptome-proteome-nas-manubot

Creative Commons Attribution 4.0 International
4 stars 1 forks source link

Adequacy of NAS proteome prediction outside of cancer studies. #5

Open Gscorreia89 opened 1 month ago

Gscorreia89 commented 1 month ago

Hi,

I am currently analysing a human transcriptome (both RNA-seq + miRNA-seq) dataset from myometrium samples for which paired proteomic data will be acquired v. soon. I was looking for approaches to predict proteome from transcriptome data and just came accross the NAS transcriptome-proteome preprint, which looks very interesting and I am keen to try on our data.

Is there any reason why this model shouldnt be applicable to non-cancer datasets? If this seems feasible, what is the best way to get started? I was thinking of maybe using a NAS model trained on CPTAC to predict a proteome and then compare it with the real measurements. Do the codes provided in https://github.com/xomicsdatascience/RnaToProteinDataModule support external datasets easily?

Gonçalo

CCranney commented 1 month ago

Hi @Gscorreia89! Thank you for your interest!

This model can definitely be applied to non-cancer datasets. In fact, when we were starting the project, we actually played around with including Alzheimer's Disease (AD) datasets, though ultimately we opted to stick to CPTAC data. However, the framework has been put in place to allow more datasets.

What we did was create a class called Dataset (found in src/RnaToProteinDataModule/Dataset.py) that can be extended to specific data types. We made two child classes, CptacDataset and AdDataset, from this class. The idea is that, provided the correct functions are implemented in the child classes, then the NAS model can load the data in a consistent way regardless of the data source.

Generally when implementing a child class of Dataset, the following should be kept in mind:

  1. You will need to include loading the proteome and transcriptome values in the init function. These should ultimately be pandas dataframes, where indices refer to IDs and column names refer to proteins/transcripts.
  2. There are two main abstract functions you will need to implement, match_patient_ids_between_omic_layers and deal_with_isoforms. The first ensures that the proteome/transcriptome share all their IDs, the latter indicates what should be done with identified isoforms.

That is a quick introduction, but I'd be happy to help work through making a child class for a new dataset.

Some extra info:

Datasets are processed in the DatasetProcessor class to synchronize them between each other before loading them into the NAS (for example, making sure the proteins are shared between proteome datasets etc.). See the prepare_data function for the process by which these are called.

I should also note the existence of the DatasetSplitter classes. In a nutshell, how you perform and normalize the train/validation split can change depending on the program you are trying to run. In the case of the DataProcessor instance used in the NAS, a train:val split of 90:10 is used (StandardDataSplitter). Other options include not splitting the data at all (NoSplitJustNormalizer), something you may want to do for additive datasets to your training, should your target dataset be too small for adequate training alone. The FiveByTwoDataSplitter class is specific to running a five by two cross validation experiment, and is not used in an NAS.

jessegmeyerlab commented 1 month ago

Thanks for your interest @Gscorreia89.

It should also be noted that you will need at least hundreds of paired proteome/transcriptomes from the same samples to train models from scratch. Better would be at least 1000 pairs. If this scale is not possible, you could alternatively tune the model with part of your data. Note that tuning would require complete overlap in the measured transcripts.

Gscorreia89 commented 1 month ago

Hi @CCranney and @jessegmeyerlab,

Thanks for the quick and detailed reply. I don't have those numbers, and what I had in mind is to simply use the "best model" trained as per your instructions and examples on CPTAC data to obtain a prediction that I can compare when the proteome data is acquired.

Btw, what type of computational resources do you require/suggest to fit the NAS model on CPTAC as per your examples?

CCranney commented 1 month ago

Do you mean for running the NAS itself, or for the optimal model that we found by using the NAS? If the former, we ran it on our HPC GPU using 40-50GB. That was likely overkill, but better to overshoot than undershoot in those instances.

CCranney commented 1 month ago

Here's an example of a script I ran on the HPC for a Neural Architectural Search:

#!/bin/bash
# 
#SBATCH -p gpu # partition (queue)
#SBATCH -c 1 # number of cores
#SBATCH --gpus-per-node=1
#SBATCH --mem 50G # memory pool for all cores
#SBATCH -t 11-12:00 # time (D-HH:MM)
#SBATCH --job-name=NAS_hier
#SBATCH -o <path>/axBatchRuns/slurmOutput/slurm.%N.%j.out # STDOUT
#SBATCH -e <path>/axBatchRuns/slurmOutput/slurm.%N.%j.err # STDERR
#SBATCH --mail-type=ALL

eval "$(conda shell.bash hook)"
conda activate ax2
cd <path>/axBatchRuns/
python run_nas_on_sample_dataset.py
Gscorreia89 commented 1 month ago

@CCranney thanks for sharing the HPC script header with the resource info. I will have a go first at using the NAS best model on my samples and get back to you if I run into any issues at this stage.