weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
192 stars 73 forks source link

We have released a new version 1.0.0 (on March 15, 2022). It has substantial computation efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.

Thanks!

Table of Contents

Introduction

Manuscript for SAIGE-GENE+: https://www.medrxiv.org/content/10.1101/2021.07.12.21260400v1

Current version is 0.45 (Updated on January 24, 2022) - comment out the part to estimate the effective sample sizes, which may not convert and take very long; put <= instead of < for maxMAF in the gene-based tests

Current version is 0.44.6.5 (Updated on August 18, 2021) - 0.44.6.2 add extdata/extractNglmm.R to extract the effective sample size without running Step 1. extdata/cmd_extractNeff.sh has the pipeline. The effective sample size (Nglmm) is differently calculated than the previous versions.

Previous version is 0.44.6.1 (Updated on July 16, 2021) - SAIGE-GENE+: for group tests, collpasing ultra-rare variants with MAC <= 10. Set --method_to_CollapseUltraRare="absence_or_presence" as default to collpase ultra-rare varaints with MAC <= 10. SAIGE-GENE+ has well controlled type I error rates when the maximum MAF cutoff (maxMAFforGroupTest) is lower than 1%, e.g. 0.01% or 0.1%. Tests with multiple MAF cutoffs and variant annotations can be combined using the Cauchy combination (function CCT)

Please re-install 0.44.2 if you installed this verion on March 31.

For BGEN input, 8 bits are required.

For BGEN input in step 2 with missing dosages, Please use version 0.38 or later.

SAIGE is an R package with Scalable and Accurate Implementation of Generalized mixed model (Chen, H. et al. 2016). It accounts for sample relatedness and is feasible for genetic association tests in large cohorts and biobanks (N > 400,000).

SAIGE performs single-variant association tests for binary traits and quantitative taits. For binary traits, SAIGE uses the saddlepoint approximation (SPA)(mhof, J. P. , 1961; Kuonen, D. 1999; Dey, R. et.al 2017) to account for case-control imbalance.

SAIGE-GENE (implemented in the SAIGE R package) performs gene- or region-based association tests (Burde, SKAT, SKAT-O) for binary traits and quantitative traits. Note: SAIGE-GENE accounts for case-control imbalance in gene-based tests (>= 0.35.8.5)

Citation

The SAIGE manuscript: Wei Zhou, Jonas B. Nielsen, Lars G. Fritsche, Maiken B. Elvestad, Brooke Wolford, Maoxuan Lin, Kristian Hveem, Hyun Min Kang, Goncalo R. Abecasis, Cristen J. Willer, Seunggeun Lee “Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.” Nature Genetics 50, 1335–1341 (2018)

The SAIGE-GENE pre-print: https://www.biorxiv.org/content/10.1101/583278v2

How to install and run SAIGE and SAIGE-GENE

Install SAIGE/SAIGE-GENE

List of dependencies:

Install SAIGE from conda

Warning: please do not use this bioconda version for bgen input. We are working on the issue.

r-saige latest_update

To install saige from conda simply create environment with latest version of R and saige:

conda create -n saige -c conda-forge -c bioconda "r-base>=4.0" r-saige
conda activate saige

More info on r-saige conda package and available versions can be found in the issue #272.

Install SAIGE using the conda environment

  1. Create a conda environment using (conda environment file) Here is a link to download the conda environment file

    After downloading environment-RSAIGE.yml, run following command

       conda env create -f environment-RSAIGE.yml
  2. Activate the conda environment RSAIGE

       conda activate RSAIGE
       FLAGPATH=`which python | sed 's|/bin/python$||'`
       export LDFLAGS="-L${FLAGPATH}/lib"
       export CPPFLAGS="-I${FLAGPATH}/include"

    Please make sure to set up the LDFLAGS and CPPFLAGS using export (the last two command lines), so libraries can be linked correctly when the SAIGE source code is compiled. Note: Here are the steps to create the conda environment file

  3. Open R, run following script to install the MetaSKAT R library.

       devtools::install_github("leeshawn/MetaSKAT") 
  4. Install SAIGE from the source code.

    Method 1:

       src_branch=master
       repo_src_url=https://github.com/weizhouUMICH/SAIGE
       git clone --depth 1 -b $src_branch $repo_src_url
    
       R CMD INSTALL --library=path_to_final_SAIGE_library SAIGE

    When call SAIGE in R, set lib.loc=path_to_final_SAIGE_library

       library(SAIGE, lib.loc=path_to_final_SAIGE_library)

    Method 2:

    Open R. Run

      devtools::install_github("weizhouUMICH/SAIGE")

Run SAIGE using a docker image

Thanks to Juha Karjalainen for sharing the Dockerfile. The docker image can be pulled

docker pull wzhou88/saige:0.45

Functions can be called

step1_fitNULLGLMM.R --help
step2_SPAtests.R --help
createSparseGRM.R --help

Run SAIGE for single-variant association tests and SAIGE-GENE for gene- or region-based tests

Here is a wiki page containg tutorial to run SAIGE and SAIGE-GENE https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE

Examples

Example data and script can be found in ./extdata. Run

bash cmd.sh

to run single-variant and gene-based association tests

extract effectize sample size v0.44.6.2)

  SAIGE_extractNeff.R --help
  bash cmd_extractNeff.sh

Notes before running jobs

FAQ can be found here

More notes

  1. Since the SPA test always provides close to 0 p-values for variants with MAC < 3, please use at least minMAC = 3 to filter out the results
  2. When query is used for bgen files, please make sure there are no duplicate SNP ids in the list
  3. If the error message "Error in setgeno(genofile, subSampleInGeno, memoryChunk) : vector::_M_range_check", try use a smaller memeoryChunk, such as 2
  4. IMPORTANT:In version <= 0.26, for binary traits, BETA is for alt allele and for quantitative traits, BETA is for minor allele
  5. Please note that LOCO only works for autosomal genetic variants. For non-autosomal genetic variants, please leave LOCO=FALSE in step 2.
  6. SAIGE-GENE 0.36.3 and 0.36.3.1 now output an effect size for burden tests with the option IsOutputBETASEinBurdenTest in step2. Please note that the magnitude of the effect size is difficult to interpret.
  7. We haven't throughly tested the program on a small sample size. All simulation studies were done using 10,000 samples. Similar to BOLT-LMM, SAIGE uses asymptotic approaches to for feasibility on large samples. Based on our previous real-data analysis, we saw the performance on 3,000 samples were fine.

UK Biobank GWAS Results

  1. The GWAS results for binary phenotypes in UK Biobank (1,283 phenotypes) using SAIGE are currently available for public download at

https://www.leelabsg.org/resources

Pheweb browser for the UK Biobank results

http://pheweb.sph.umich.edu/SAIGE-UKB/

*This research has been conducted using the UK Biobank Resource under application number 24460.

  1. The exome-wide gene-based association results for quantitative traits in UK Biobank (53 traits) using SAIGE-GENE are currently available for public download at

https://www.leelabsg.org/resources

*This research has been conducted using the UK Biobank Resource under application number 45227.

Log for fixing bugs

Bugs fixed: 1. The option weights.beta.common is not fully correctly developed, so we make weights.beta.common equal to weights.beta.rare for now. 2. Instead of output NA for SKAT-O p values when the function SKAT:::Met_SKAT_Get_Pvalue failed, output 2*min(SKAT p, Burden p, 0.5).

** Note: in v0.36.1, plain text dosage files are no longer allowed as input in step 2 to get rid of the dependence of the boost_iostream library

Bugs fixed: 1. fixed the freq calculation for mean impute for missing genotypes in plinkFile 2. Diagonal elements of GRM are now estimated using markers in plinkFile with MAF >= minMAFforGRM 3. Conditional analysis for gene- or region-based test for binary traits is now accounting for case-control imbalance 4. plain dosage files are no longer supported for step 2 so no external boost_iostream library is needed

minMAFforGRM is added as a parameter in step 0 and 1, so only markers in the plinkFile with MAF >= minMAFforGRM will be used for GRM weights.beta.rare, weights.beta.common, weightMAFcutoff, dosageZerodCutoff, IsOutputPvalueNAinGroupTestforBinary, IsAccountforCasecontrolImbalanceinGroupTest are added as new parameters in step 2