weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
187 stars 72 forks source link

Unexpected behavior of SAIGE-GENE+ in step 2 (a possible bug?) #363

Closed Ojami closed 2 years ago

Ojami commented 3 years ago

I was trying to perform SAIGE-GENE (0.44.6.1) step 2 with VCF files with the same samples used in constructing the GRM and null fitting. The results were not consistent with what I already knew from running SKAT (in R). Strictly speaking, in step 2 number of markers were lower than expected. So, I repeated the same analysis this time using a VCF file with all available samples (not QCed). This time the results were comparable. I post the log of these two scenarios (truncated) below.

scenario 1: QCed VCF (same samples used for steps 0 and 1)

SPAGMMATtest(vcfFile="QCed.vcf.gz",vcfFileIndex="QCed.vcf.gz.csi",vcfField="GT",sampleFile="step2.sampleFile.txt",chrom="4",GMMATmodelFile="step1_nullmodel.rda",varianceRatioFile="step1_nullmodel.varianceRatio.txt",SAIGEOutputFile="QCed.SPA.txt",sparseSigmaFile="step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx",groupFile="geneset.1.tmpg.txt",IsDropMissingDosages=FALSE,IsOutputAFinCaseCtrl=TRUE,IsOutputNinCaseCtrl=TRUE,LOCO=FALSE,IsSingleVarinGroupTest=FALSE,IsOutputBETASEinBurdenTest=TRUE,minMAC=0.5,minMAF=0,maxMAFforGroupTest=0.01,minInfo=0.78,MACCutoff_to_CollapseUltraRare=10)
group-based test will be performed
Any dosages <=  0.2  for genetic variants with MAC <= 10 are set to be 0 in group tests
Garbage collection 13 = 7+2+4 (level 2) ... 
75.1 Mbytes of cons cells used (56%)
30.8 Mbytes of vectors used (48%)
21241  samples have been used to fit the glmm null model
[1] "Leave-one-chromosome-out is not applied"
variance Ratio is  1.005161 1.005213 1.005098 1.004881 1.004657 1.004939 1.004496 0.982269 
177690  sample IDs are found in sample file
isCondition is  FALSE 
Open VCF done
To read the field GT
Number of meta lines in the vcf file (lines starting with ##): 6
Number of samples in the vcf file: 177690
[1] 177690      4
[1] "IID"          "IndexInModel" "IndexDose.x"  "IndexDose.y" 
21241  samples were used in fitting the NULL glmm model and are found in sample file
sparse kinship matrix is going to be used
sparseSigmaFile:  step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx 
Missing dosages will be mean imputed for the analysis
Analysis started at  1628418386 Seconds
minMAC:  0.5 
minMAF:  0 
Minimum MAF of markers to be tested is  1.176969e-05 
It is a quantitative trait
isCondition is  FALSE 
Analysis started at  1628418386 Seconds
genetic variants with  1.176969e-05 <= MAF <=  0.01 are included for gene-based tests
It is a quantitative trait
Ultra rare variants with MAC <=  10  will be collpased for set-based tests in the 'absence or presence' way.  For the resulted collpased marker, any individual having  0.5 <= dosage <  1.5  for any ultra rare variant has 1 in the genotype vector, having dosage >=  1.5  for any ultra rare variant has 2 in the genotype vector, otherwise 0. 
isCondition is  FALSE 
geneID:  MTTP_ENSG00000138823 
std::size_t sample_size = marker_file.samples().size();177690
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
cntMarker:  6 
isCondition is  FALSE 
weights:  24.83107 
[1] "DEBUG1"
[1] "DEBUG2"
time for SAIGE_SKAT_withRatioVec
   user  system elapsed 
  0.031   0.063   0.105 
[1] "OK1"
closed the genofile!

and results:

Gene Pvalue Nmarker_MACCate_1 Nmarker_MACCate_2 Nmarker_MACCate_3 Nmarker_MACCate_4 Nmarker_MACCate_5 Nmarker_MACCate_6 Nmarker_MACCate_7 Nmarker_MACCate_8 markerIDs markerAFs Pvalue_Burden Pvalue_SKAT BETA_Burden SE_Burden
MTTP_ENSG00000138823 0.728893298645835 0 0 0 0 0 0 1 0 ultra_rare_collpase_4:99581983_GC/G_4:99589664_C/CA_4:99591635_CTATT/C_4:99591650_G/C_4:99591734_AAC/A_4:99611241_G/A 0.000282472576620686 NA NA NA NA

scenario 2: Full VCF (all samples)

SPAGMMATtest(vcfFile="full.vcf.gz",vcfFileIndex="full.vcf.gz.csi",vcfField="GT",sampleFile="step2.sampleFile.txt",chrom="4",GMMATmodelFile="step1_nullmodel.rda",varianceRatioFile="step1_nullmodel.varianceRatio.txt",SAIGEOutputFile="full.SPA.txt",sparseSigmaFile="step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx",groupFile="geneset.1.tmpg.txt",IsDropMissingDosages=FALSE,IsOutputAFinCaseCtrl=TRUE,IsOutputNinCaseCtrl=TRUE,LOCO=FALSE,IsSingleVarinGroupTest=FALSE,IsOutputBETASEinBurdenTest=TRUE,minMAC=0.5,minMAF=0,maxMAFforGroupTest=0.01,minInfo=0.78,MACCutoff_to_CollapseUltraRare=10)
group-based test will be performed
Any dosages <=  0.2  for genetic variants with MAC <= 10 are set to be 0 in group tests
Garbage collection 13 = 7+2+4 (level 2) ... 
75.1 Mbytes of cons cells used (56%)
30.8 Mbytes of vectors used (48%)
21241  samples have been used to fit the glmm null model
[1] "Leave-one-chromosome-out is not applied"
variance Ratio is  1.005161 1.005213 1.005098 1.004881 1.004657 1.004939 1.004496 0.982269 
200643  sample IDs are found in sample file
isCondition is  FALSE 
Open VCF done
To read the field GT
Number of meta lines in the vcf file (lines starting with ##): 6
Number of samples in the vcf file: 200643
[1] 200643      4
[1] "IID"          "IndexInModel" "IndexDose.x"  "IndexDose.y" 
21241  samples were used in fitting the NULL glmm model and are found in sample file
sparse kinship matrix is going to be used
sparseSigmaFile:  step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx 
Missing dosages will be mean imputed for the analysis
Analysis started at  1628418167 Seconds
minMAC:  0.5 
minMAF:  0 
Minimum MAF of markers to be tested is  1.176969e-05 
It is a quantitative trait
isCondition is  FALSE 
Analysis started at  1628418167 Seconds
genetic variants with  1.176969e-05 <= MAF <=  0.01 are included for gene-based tests
It is a quantitative trait
Ultra rare variants with MAC <=  10  will be collpased for set-based tests in the 'absence or presence' way.  For the resulted collpased marker, any individual having  0.5 <= dosage <  1.5  for any ultra rare variant has 1 in the genotype vector, having dosage >=  1.5  for any ultra rare variant has 2 in the genotype vector, otherwise 0. 
isCondition is  FALSE 
geneID:  MTTP_ENSG00000138823 
std::size_t sample_size = marker_file.samples().size();200643
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
missing_cnt > 0!
cntMarker:  13 
isCondition is  FALSE 
weights:  24.62144 
[1] "DEBUG1"
[1] "DEBUG2"
time for SAIGE_SKAT_withRatioVec
   user  system elapsed 
  0.047   0.047   0.111 
[1] "OK1"
closed the genofile!

and results:

Gene Pvalue Nmarker_MACCate_1 Nmarker_MACCate_2 Nmarker_MACCate_3 Nmarker_MACCate_4 Nmarker_MACCate_5 Nmarker_MACCate_6 Nmarker_MACCate_7 Nmarker_MACCate_8 markerIDs markerAFs Pvalue_Burden Pvalue_SKAT BETA_Burden SE_Burden
MTTP_ENSG00000138823 1.57717482806028e-05 0 0 0 0 0 0 0 1 ultra_rare_collpase_4:99564186_CG/C_4:99564195_A/ACC_4:99574972_T/C_4:99581983_GC/G_4:99589664_C/CA_4:99591353_T/C_4:99591635_CTATT/C_4:99591650_G/C_4:99597081_G/A_4:99600734_G/A_4:99601691_C/T_4:99601716_T/A_4:99611241_G/A 0.000635563297396544 NA NA NA NA

Please note that the difference between number of markers in two scenarios is not due to 0 MAF, meaning those 13 markers in scenario 2 also have a non zero MAF in scenario 1. On a different note, SAIGE-GENE+ does not generate summary stats for SKAT and burden (all NaN). This is a direct consequence of MACCutoff_to_CollapseUltraRare cutoff value (lowering this value would force SAIGE to generate summary stats for SKAT and burden as well).

Appreciate any thoughts in advance. OJ

weizhouUMICH commented 2 years ago

Hi @Ojami,

Thanks for all the information! When only QC'ed samples were used, are all 13 markers having MAF > 0. Do they have MAF > 1.176969e-05 (MAC > 0.5)? It seems only 6 of them passed the cutoff 1.176969e-05 <= MAF <= 0.01 and all of them have MAC <= 10.

When there is only 1 marker, only one p-value is output because it does not make sense to conduct different set-based tests. When lowering MACCutoff_to_CollapseUltraRare, not all ultra-rare markers are collapsed and there will be more than one marker, so you will see p-values for SKAT and Burden tests.

Thanks, Wei

Ojami commented 2 years ago

Hi @weizhouUMICH

Thanks for your response. Please note that

21241 samples have been used to fit the glmm null model

and 21241 is a subset of both full (200643) and QCed (177690) samples, so both analyses should end up in the same results, simply because 21241 samples have been used for null model. But SAIGE computes MAFs using VCF samples: std::size_t sample_size = marker_file.samples().size();177690 Shouldn't MAFs correspond for samples used in step 1 (null)? because after all, these samples ( 21241 in my example) will be used for gene-/region-based tests.

Thanks! Oveis

weizhouUMICH commented 2 years ago

Hi @Ojami,

Yes, You are right. The MAFs should always be calculated for samples used in step 1. Let me try to replicate the issue.

Thanks, Wei

weizhouUMICH commented 2 years ago

Hi @Ojami,

Sorry for my very late reply! We have just released a new version 1.0.0. It has computational efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.

Thanks! Wei