weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
188 stars 73 forks source link

SAIGE (linear mixed model, quantitative) and BOLT-LMM yield different results when run on the same dataset #303

Closed futurologist closed 2 years ago

futurologist commented 3 years ago

Hello,

As a validation experiment, I have run the same GWAS of a quantitative phenotype derived from the UKBiobank, alongside the genomic data from the UKBiobank, once using the program BOLT-LMM and once using SAIGE linear mixed model (with selected quantitative trait tag). I wanted to see if the results would be comparable.

I however encountered a significantly lower p-values from the SAIGE summary statistics than from the corresponding BOLT ones. In particular, I had many snp loci along the genome that were significant (with 10^(-8) level of significance) according to BOLT but were not significant (relative to the same level) according to SAIGE.

My question is, why might there be such discrepancy (e.g. what have I done wrong) and what is the proper way to set up a liner mixed model GWAS run in SAIGE (or alternatively how to set up its BOLT counterpart) in order for the results of both BOLT and SAIGE to be comparable?

I have included the input tags that I have used in my runs with BOLT and SAIGE (Step1 and Step2), in case this is useful. The (bed, bim, fam), (bgen, bgen-index, sample) as well as (phenotype and covariates table, phenotype and covariate columns) used in both scripts are the same.

BOLT:

/BOLT-LMM_v2.3.4/bolt 
--bed=  --bim=  --fam=  --remove= 
--phenoFile=  --phenoCol=  --covarFile=  --qCovarCol=  --qCovarCol=  --covarCol= \
--LDscoresMatchBp --maxMissingPerIndiv 1 --lmm \
--LDscoresFile=LDSCORE.1000G_EUR.tab.gz --geneticMapFile=genetic_map_hg19.txt.gz \
--numThreads=32 --bgenFile=1.bgen --sampleFile=1.sample \
--statsFile=   --statsFileBgenSnps=

SAIGE Step 1:

/SAIGE/SAIGE-0.35.8.3/extdata/step1_fitNULLGLMM.R \
--plinkFile=  --phenoFile=  --sampleIDColinphenoFile= --phenoCol=  \
--traitType=quantitative --invNormalize=TRUE \
--covarColList=  --outputPrefix=  --nThreads=32 --LOCO=FALSE --tauInit=1,0

SAIGE Step 2:

/SAIGE/SAIGE-0.35.8.3/extdata/step2_SPAtests.R --minMAF=  --minMAC=  \ 
--bgenFile=  --bgenFileIndex=  --sampleFile= \
--GMMATmodelFile=  --varianceRatioFile= \
--SAIGEOutputFile=  --numLinesOutput=2 --IsDropMissingDosages=FALSE --LOCO=FALSE
ConnieXuhm commented 3 years ago

hi @futurologist , I contact you for advice about the data conversion, since my SPAGMMAT step always meet the error. I want to know which conversion method do you use for the bed/bim/fam to bgen, or bgen to bed/bim/fam maybe ? So much appreciate for your reply, thanks!

futurologist commented 3 years ago

Hi @ConnieXuhm , I use the genomic data provided to me by the UKBiobank. The data contains a set of bed/bim/fam files of genotyped snps for each chromosome and a set of bgen/sample/bgen.bgi files of imputed snps. The number of all snps in the bed/bim/fam is a bit less than 800 000, while the number of snps in the bgen/sample/bgn.bgi is over 90 000 000.

If you have genotyped (bed) and imputed data (bgen, dosage file), do not convert anything, simply use genotyped for step 1 and imputed for step 2. If you have only genotyped data, then you can use PLINK2 to convert your bed files into bgen files. See for example

http://www.cog-genomics.org/plink/2.0/data#export

The syntax is something like

plink2 --bfile bed_file_prefix --export bgen-1.2 --out bgen_file_prefix

This will create create a bgen and a sample file. You would need a bgen.bgi file too, which can be created by another tool, bgenix:

https://enkre.net/cgi-bin/code/bgen/wiki?name=bgenix

ConnieXuhm commented 3 years ago

Many thanks to @futurologist !! I received the genotyped data by the company, and did the imputation myself using SHAPEIT and IMPUTE2. Previously I used the IMPUTE2 output (.gen format) for step 1 and step2. By your suggestion, I will try only put the imputed file into step 2. And maybe bgenix can convert .gen to .bgen? Thanks again!

futurologist commented 3 years ago

hi @ConnieXuhm,

You may now this, but just in case, SAIGE has a tutorial that explains what data to use with step 1 and step 2. Here is the link.

https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE

I am not sure if bgenix can convert gen to bgen format. From what I understand, it generates mainly the bgen.bgi index files from bgens (you need bgen.bgi in step 2). But there is another program, called qctool v2

https://www.well.ox.ac.uk/~gav/qctool_v2/

that I think can convert gen to bgen. You can also if plink2 can convert gen to bgen, although I am not sure about that.

The whole idea is that step 1 of SAIGE calculates the genetic relationship matrix of the mixed model and so you use the smaller genotyped files (you could even filter them a bit to speed up the calculation) for step 1, while the actual calculation of the association coefficients and the significance of each individual snp from the imputed data is performed during step 2, where the files used are the imputed bgen files.

ConnieXuhm commented 3 years ago

Thanks @futurologist! I know deeper about SAIGE by your explanation, and I will try the conversion in step2 using qctool. You help me a lot ! >..<

weizhouUMICH commented 3 years ago

Hi @futurologist,

Based on our previous experiments, for quantitative traits, the p-values from SAIGE and BOLT are more consistent than you observed. My doubt is that the sample file formats used in BOLT and SAIGE are different. In SAIGE, the sample file does not need a header. Can you please check? This can mess up the sample ID reading and generate random signals.

Thanks, Wei

futurologist commented 3 years ago

Hi @weizhouUMICH ,

Yes, I am aware of the no-header requirement for the bgen sample file when used with SAIGE. I confirm that the sample file I use for SAIGE does not have a header. Moreover, I have run several other runs with binary phenotypes using SAIGE already. What I did however, was to set up a test, using a selected list of snps comparing BOLT, SAIGE and a simple linear regerssion (i.e. no random effects included) with the same phenotype. The results from BOLT and the simple linear regression were fairly close but the the results from SAIGE were a bit different. In particular, the p-values of SAIGE were somewhat higher than the p-value from BOLT and the simple linear regression. Basically with BOLT I get more snps with higher significance than with SAIGE, i.e. the latter seems to be more conservative than the former (at least for my phenotype).

I guess, my question was, are there specific input flags I have to add to my SAIGE script in order to get relatively close results to the ones from BOLT.

Best

weizhouUMICH commented 3 years ago

Hi @futurologist,

The main difference we've noticed between BOLT-LMM and SAIGE is the leave-one-chromosome-out setting. In BOLT, LOCO=TRUE by default, while in previous SAIGE version, LOCO=FALSE by default. If you try LOCO=TRUE in SAIGE, the association results for quantitative traits should be more closer to BOLT.

Thanks, Wei