SR-TWAS stands for Stacked Regression-Transcriptome-Wide Association Study, which is developed using Python and BASH scripts for creating improved Genetically Regulated gene eXpression (GReX) prediction models by using an Ensemble Machine Learning technique of Stacked Regression to form optimal linear combinations of previously trained gene expression imputation models. SR-TWAS allows users to leverage multiple reference panels of the same tissue type in order to improve GReX prediction accuracy and TWAS power with increased effective training sample sizes.
SR-TWAS: Leveraging Multiple Reference Panels to Improve Transcriptome-Wide Association Study Power by Ensemble Machine Learning. Randy L. Parrish, Aron S. Buchman, Shinya Tasaki, Yanling Wang, Denis Avey, Jishu Xu, Philip L. De Jager, David A. Bennett, Michael P. Epstein, Jingjing Yang. Nature Communications 15, 6646 (2024). doi: https://doi.org/10.1038/s41467-024-50983-w
*.sh
files in the SR-TWAS directory executable
SR_TWAS_dir="/home/jyang/GIT/SR-TWAS"
chmod 755 ${SR_TWAS_dir}/*.sh
Example input files provided under ./ExampleData/
are generated artificially. All input files are Tab Delimited Text Files.
--genofile_type vcf
--format GT
--format DS
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | Sample1 | Sample... |
---|---|---|---|---|---|---|---|---|---|---|
1 | 100 | rs1 | C | T | . | PASS | . | GT:DS | 0/0:0.01 | ... |
bgzip
, and tabixed by tabix
tabix -f -p vcf *.vcf.gz
../ExampleData/genotype.vcf.gz
--genofile_type dosage
CHROM | POS | ID | REF | ALT | Sample1 | Sample... |
---|---|---|---|---|---|---|
1 | 100 | rs** | C | T | 0.01 | ... |
./ExampleData/train_sampleID.txt
--train_sampleID
CHROM | GeneStart | GeneEnd | TargetID | GeneName | Sample1 | Sample... |
---|---|---|---|---|---|---|
1 | 100 | 200 | ENSG0000 | X | 0.2 | ... |
./ExampleData/gene_expression.txt
--gene_exp
--gene_anno
CHROM | POS | REF | ALT | TargetID | ES | MAF |
---|---|---|---|---|---|---|
1 | 100 | C | T | ENSG0000 | 0.2 | 0.02 |
*_eQTLweights.txt
files obtained from the TIGAR tool may be used directly.bgzip -f my_eQTLweights.txt && tabix -f -b2 -e2 -S1 my_eQTLweights.txt.gz
.ES
to denote variant weights (estimated eQTL effect sizes from a reference dataset) with for the gene (TargetID).--maf_diff
to exclude exclude eQTL weights obtained from data in which the absolute value between the MAF substantially different from that of the validation data.CHROM:POS:REF:ALT
snpID./ExampleData/eQTLweights.txt
--weights
--chr
: Chromosome number need to be specified with respect to the genotype input data--weights
: Space delimited list of paths to trained eQTL weight files (bgzipped and tabixed). --weights_names
: Space delimited list of names corresponding to base models specified by --weights
; if user-submitted list is not the same length as number of base models or names are non-unique will use default (default: [W0, W1, W2, ..., W{K}]
)--gene_exp
: Path to Gene annotation and Expression file--genofile
: Path to the training genotype file (bgzipped and tabixed)--genofile_type
: Genotype file type: vcf
or dosage
--format
: (Required if genofile_type
is vcf
) Genotype data format that should be used: GT
or DS
GT
: genotype dataDS
: dosage data--train_sampleID
: Path to a file with sampleIDs that will be used for training--maf_diff
: Minor Allele Frequency threshold (ranges from 0 to 1) to exclude eQTL weights obtained from data in which the absolute value between the MAF substantially different from that of the validation data. (ie, if the absolute value of the difference between the validation genotype data and the weight file MAF exceeds maf_diff
, then the effect size from that weight file is excluded). To use this option all weight files MUST contain a MAF
column. (default: 0
[no filtering])--missing rate
: Missing rate threshold. If the rate of missing values for a SNP exceeds this value the SNP will be excluded. Otherwise, missing values for non-excluded SNPs will be imputed as the mean. (default: 0.2
)--hwe
: Hardy Weinberg Equilibrium p-value threshold to exclude variants that violated HWE (default: 0.00001
)weight_threshold
: Exclude eQTL weights with a magnitude less than this threshold (default: 0
[no filtering])--window
: Window size (in base pairs) around gene region from which to include SNPs (default: 1000000
[+- 1MB
region around gene region])--cvR2
: Whether to perform 5-fold cross validation by average R2: 0
or 1
(default: 1
)
0
: Skip 5-fold cross validation evaluation1
: Do 5-fold cross validation and only do final training if the average R2 of the 5-fold validations is at least >= cvR2_threshold--cvR2_threshold
: Threshold for 5-fold CV. (default: 0.005
)--parallel
: Number of simultaneous processes to use for parallel computation (default: 1
)--SR_TWAS_dir
: Specify the directory of SR-TWAS source code--out_dir
: Output directory (will be created if it does not exist)--sub_dir
: Whether to create a subdirectory for output files within the specified output directory. (default: 1
)
0
: Output files will be placed in the output directory1
: Output files will be placed into subdirectory called SR_CHR${chr}
within output directory--out_prefix
: Filenames for the output weight, info, and log files begin with this string if filenames are not otherwise specified using --out_weight_file
, --out_info_file
, --log_file
. (default: CHR${chr}_SR_train
)--out_weight_file
: Filename for output eQTL weights file. (default: ${out_prefix}_eQTLweights.txt
, CHR${chr}_SR_train_eQTLweights.txt
)--out_info_file
: Filename for output training info file. (default: ${out_prefix}_GeneInfo.txt
, CHR${chr}_SR_train_GeneInfo.txt.txt
)--log_file
: (default: ${out_prefix}_log.txt
, CHR${chr}_SR_train_log.txt
)gene_exp="${SR_TWAS_dir}/ExampleData/gene_expression.txt"
train_sampleID="${SR_TWAS_dir}/ExampleData/train_sampleID.txt"
genofile="${SR_TWAS_dir}/ExampleData/genotype.vcf.gz"
out_dir="${SR_TWAS_dir}/ExampleData/output"
weight0="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort0_eQTLweights.txt.gz"
weight_name0=cohort0
weight1="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort1_eQTLweights.txt.gz"
weight_name1=cohort1
weight2="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort2_eQTLweights.txt.gz"
weight_name2=cohort2
${SR_TWAS_dir}/SR-TWAS.sh \
--chr 1 \
--cvR2 1 \
--format GT \
--gene_exp ${gene_exp} \
--genofile ${genofile} \
--genofile_type vcf \
--hwe 0.0001 \
--maf 0.01 \
--out_dir ${out_dir} \
--parallel 2 \
--SR_TWAS_dir ${SR_TWAS_dir} \
--train_sampleID ${train_sampleID} \
--weights ${weight0} ${weight1} ${weight2} \
--weights_names ${weight_name0} ${weight_name1} ${weight_name2}
${out_dir}/SR_CHR${chr}/CHR${chr}_SR_train_eQTLweights.txt
is the file storing all eQTL effect sizes (ES
) estimated from the gene expression imputation model per gene (TargetID
)${out_dir}/SR_CHR${chr}/CHR${chr}_SR_train_GeneInfo.txt
is the file storing information about the fitted gene expression imputation model per gene (per row), including gene annotation (CHROM GeneStart GeneEnd TargetID GeneName
), sample size (sample_size
), number of SNPs used in the model training (N_SNP
), number of SNPs with non-zero eQTL effect sizes (N_EFFECT_SNP
), imputation R2 by 5-fold cross validation (CVR2
), imputation R2 using all given training samples (R2
), p-value of the training R2 (PVAL
), and the zeta weight used for each component weight model (Z0
,Z1
,...), and 4 columns for each component weight model describing performance on the validation data (W0_N_SNP
,W0_CVR2
,W0_R2
,W0_PVAL
,W1_N_SNP
,W1_CVR2
,W1_R2
,W1_PVAL
,...)${out_dir}/logs/CHR${chr}_SR_train_log.txt
is the file storing all log messages for model training.Arguments are the same as for SR-TWAS.
gene_exp="${SR_TWAS_dir}/ExampleData/gene_expression.txt"
train_sampleID="${SR_TWAS_dir}/ExampleData/train_sampleID.txt"
genofile="${SR_TWAS_dir}/ExampleData/genotype.vcf.gz"
out_dir="${SR_TWAS_dir}/ExampleData/output"
weight0="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort0_eQTLweights.txt.gz"
weight_name0=cohort0
weight1="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort1_eQTLweights.txt.gz"
weight_name1=cohort1
weight2="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort2_eQTLweights.txt.gz"
weight_name2=cohort2
${SR_TWAS_dir}/Naive.sh \
--chr 1 \
--cvR2 1 \
--format GT \
--gene_exp ${gene_exp} \
--genofile ${genofile} \
--genofile_type vcf \
--hwe 0.0001 \
--maf 0.01 \
--out_dir ${out_dir} \
--parallel 2 \
--SR_TWAS_dir ${SR_TWAS_dir} \
--train_sampleID ${train_sampleID} \
--weights ${weight0} ${weight1} ${weight2} \
--weights_names ${weight_name0} ${weight_name1} ${weight_name2}
${out_dir}/Naive_CHR${chr}/CHR${chr}_naive_train_eQTLweights.txt
is the file storing all eQTL effect sizes (ES
) estimated from the gene expression imputation model per gene (TargetID
)${out_dir}/Naive_CHR${chr}/CHR${chr}_naive_train_GeneInfo.txt
is the file storing information about the fitted gene expression imputation model per gene (per row), including gene annotation (CHROM GeneStart GeneEnd TargetID GeneName
), sample size (sample_size
), number of SNPs used in the model training (N_SNP
), number of SNPs with non-zero eQTL effect sizes (N_EFFECT_SNP
), imputation R2 by 5-fold cross validation (CVR2
), imputation R2 using all given training samples (R2
), p-value of the training R2 (PVAL
), and the zeta weight used for each component weight model (Z0
,Z1
,...), and 4 columns for each component weight model describing performance on the validation data (W0_N_SNP
,W0_CVR2
,W0_R2
,W0_PVAL
,W1_N_SNP
,W1_CVR2
,W1_R2
,W1_PVAL
,...)${out_dir}/logs/CHR${chr}_naive_train_log.txt
is the file storing all log messages for model training.--chr
: Chromosome number need to be specified with respect to the genotype input data--weights
: Space delimited list of paths to trained eQTL weight files (bgzipped and tabixed). --weights_names
: Space delimited list of names corresponding to base models specified by --weights
; if user-submitted list is not the same length as number of base models or names are non-unique will use default (default: [W0, W1, W2, ..., W{K}]
)--gene_anno
: Path to Gene annotation file--window
: Window size (in base pairs) around gene region from which to include SNPs (default: 1000000
[+- 1MB
region around gene region])--parallel
: Number of simultaneous processes to use for parallel computation (default: 1
)--SR_TWAS_dir
: Specify the directory of SR-TWAS source code--out_dir
: Output directory (will be created if it does not exist)--sub_dir
: Whether to create a subdirectory for output files within the specified output directory. (default: 1
)
0
: Output files will be placed in the output directory1
: Output files will be placed into subdirectory called Avg_CHR${chr}
within output directory--out_prefix
: Filenames for the output weight, info, and log files begin with this string if filenames are not otherwise specified using --out_weight_file
, --out_info_file
, --log_file
. (default: CHR${chr}_AvgvalidSR_train
)--out_weight_file
: Filename for output eQTL weights file. (default: ${out_prefix}_eQTLweights.txt
, CHR${chr}_AvgvalidSR_train_eQTLweights.txt
)--out_info_file
: Filename for output training info file. (default: ${out_prefix}_GeneInfo.txt
, CHR${chr}_AvgvalidSR_train_GeneInfo.txt.txt
)--log_file
: (default: ${out_prefix}_log.txt
, CHR${chr}_AvgvalidSR_train_log.txt
)
gene_anno="${SR_TWAS_dir}/ExampleData/gene_expression.txt"
out_dir="${SR_TWAS_dir}/ExampleData/output"
valid_weight0="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort0_eQTLweights.txt.gz"
valid_weight_name0=valid_model0
valid_weight1="${SR_TWAS_dir}/ExampleData/CHR1_DPR_cohort1_eQTLweights.txt.gz"
valid_weight_name1=valid_model1
## run example SR command to get output weight file
SR_weight="${SR_TWAS_dir}/ExampleData/output/SR_CHR1/CHR1_SR_train_eQTLweights.txt.gz"
SR_weight_name=SR
${SR_TWAS_dir}/Avg-valid_SR.sh \
--gene_anno ${gene_exp} \
--chr 1 \
--parallel 2 \
--out_dir ${out_dir} \
--SR_TWAS_dir ${SR_TWAS_dir} \
--weights ${valid_weight0} ${valid_weight1} ${SR_weight} \
--weights_names ${valid_weight_name0} ${valid_weight_name1} ${SR_weight_name}
${out_dir}/Avg_CHR${chr}/CHR${chr}_AvgvalidSR_train_eQTLweights.txt
is the file storing all eQTL effect sizes (ES
) estimated from the gene expression imputation model per gene (TargetID
)${out_dir}/Avg_CHR${chr}/CHR${chr}_AvgvalidSR_train_GeneInfo.txt
is the file storing information about the fitted gene expression imputation model per gene (per row), including gene annotation (CHROM GeneStart GeneEnd TargetID GeneName
), sample size (sample_size
), number of SNPs used in the model training (N_SNP
), number of SNPs with non-zero eQTL effect sizes (N_EFFECT_SNP
), and an column for component weight model with the number of SNPs that were used from that model (W0_N_SNP
,W1_N_SNP
,...)${out_dir}/logs/CHR${chr}_AvgvalidSR_train_log.txt
is the file storing all log messages for model training.