weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
187 stars 72 forks source link

step2_SPAtests.R doesn't analyze SNPs that are expected to be analyzed #400

Closed gurinovich closed 2 years ago

gurinovich commented 2 years ago

Hello, We are performing a SVT on a binary trait using SAIGE 0.44.6.4 using whole-genome sequencing data. In the second step (step2_SPAtests.R) we use a VCF file as an input and run analyses per chromosome. SAIGE parses all the SNPs in the VCF file (722186 SNPs) as you can see in the snippet of the log below, but it excludes some SNPs (not just the ones with low MAF/MAC) and we cannot identify why. We are working on the comparison of SAIGE and GENESIS and GENESIS does analyze the SNPs that are missing from SAIGE’s output as expected. For the SNPs that are being analyzed and returned in the SAIGEOutputFile, the association results appear to be correct. Here is the script used to run the analyses: Rscript step2_SPAtests.R \ --vcfFile=LLFS.WGS.freeze4.chr21.BI-SNP.vcf.gz \ --vcfFileIndex=LLFS.WGS.freeze4.chr$chr.BI-SNP.vcf.gz.csi \ --vcfField=GT \ --chrom=chr21 \ --LOCO=FALSE \ --GMMATmodelFile=output/GWAS_EL.rda \ --varianceRatioFile=output/GWAS_EL.varianceRatio.txt \ --SAIGEOutputFile=GWAS_EL_SAIGE_per_chr/chr21.vcf.genotype.txt \ --numLinesOutput=2 \ --IsOutputAFinCaseCtrl=TRUE \ --IsOutputNinCaseCtrl=TRUE \ --IsOutputHetHomCountsinCaseCtrl=TRUE \ --IsDropMissingDosages=TRUE

Here is the beginning (and end) of the log for step2_SPAtests.R run on chromosome 21: R version 4.0.5 (2021-03-31) Platform: x86_64-conda-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /rprojectnb2/necs/agurinov/.conda/envs/saige/lib/libopenblasp-r0.3.18.so

locale: [1] C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SAIGE_0.44.6.4

loaded via a namespace (and not attached): [1] compiler_4.0.5 Matrix_1.3-4 Rcpp_1.0.7 grid_4.0.5
[5] RcppParallel_5.1.4 lattice_0.20-45
$vcfFile [1] "/rprojectnb2/llfs/LLFS_WGS_04.2021/analysis/Ana/VCF_by_chr_id/LLFS.WGS.freeze4.chr21.BI-SNP.vcf.gz"

$vcfFileIndex [1] "/rprojectnb2/llfs/LLFS_WGS_04.2021/analysis/Ana/VCF_by_chr_id/LLFS.WGS.freeze4.chr21.BI-SNP.vcf.gz.csi"

$vcfField [1] "GT"

$bgenFile [1] ""

$bgenFileIndex [1] ""

$savFile [1] ""

$savFileIndex [1] ""

$idstoExcludeFile [1] ""

$idstoIncludeFile [1] ""

$rangestoExcludeFile [1] ""

$rangestoIncludeFile [1] ""

$chrom [1] "chr21"

$start [1] 1

$end [1] 2.5e+08

$IsDropMissingDosages [1] TRUE

$minMAF [1] 0

$minMAC [1] 0

$maxMAFforGroupTest [1] 0.5

$minInfo [1] 0

$sampleFile [1] ""

$GMMATmodelFile [1] "/restricted/projectnb/llfs/LLFS_WGS_04.2021/analysis/Nastia/SVT/output/GWAS_EL.rda"

$varianceRatioFile [1] "/restricted/projectnb/llfs/LLFS_WGS_04.2021/analysis/Nastia/SVT/output/GWAS_EL.varianceRatio.txt"

$SAIGEOutputFile [1] "/restricted/projectnb/llfs/LLFS_WGS_04.2021/analysis/Nastia/SVT/GWAS_EL_SAIGE_per_chr_updated/chr21.vcf.genotype.txt"

$numLinesOutput [1] 2

$IsSparse [1] TRUE

$SPAcutoff [1] 2

$IsOutputAFinCaseCtrl [1] TRUE

$IsOutputNinCaseCtrl [1] TRUE

$IsOutputHetHomCountsinCaseCtrl [1] TRUE

$LOCO [1] FALSE

$condition [1] ""

$sparseSigmaFile [1] ""

$groupFile [1] ""

$kernel [1] "linear.weighted"

$method [1] "optimal.adj"

$weights.beta.rare [1] "1,25"

$weights.beta.common [1] "1,25"

$weightMAFcutoff [1] 0.01

$r.corr [1] "0"

$IsSingleVarinGroupTest [1] FALSE

$IsOutputMAFinCaseCtrlinGroupTest [1] FALSE

$cateVarRatioMinMACVecExclude [1] "0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5"

$cateVarRatioMaxMACVecInclude [1] "1.5,2.5,3.5,4.5,5.5,10.5,20.5"

$dosageZerodCutoff [1] 0.2

$IsOutputPvalueNAinGroupTestforBinary [1] FALSE

$IsAccountforCasecontrolImbalanceinGroupTest [1] TRUE

$weightsIncludeinGroupFile [1] FALSE

$IsOutputBETASEinBurdenTest [1] FALSE

$IsOutputlogPforSingle [1] FALSE

$sampleFile_male [1] ""

$X_PARregion [1] ""

$is_rewrite_XnonPAR_forMales [1] FALSE

$method_to_CollapseUltraRare [1] "absence_or_presence"

$MACCutoff_to_CollapseUltraRare [1] 10

$DosageCutoff_for_UltraRarePresence [1] 0.5

$help [1] FALSE

weights.beta.rare is 1 25 weights.beta.common is 1 25 cateVarRatioMinMACVecExclude is 0.5 1.5 2.5 3.5 4.5 5.5 10.5 20.5 cateVarRatioMaxMACVecInclude is 1.5 2.5 3.5 4.5 5.5 10.5 20.5 single-variant association test will be performed Garbage collection 14 = 8+2+4 (level 2) ... 80.3 Mbytes of cons cells used (54%) 20.0 Mbytes of vectors used (31%) 1164 samples have been used to fit the glmm null model [1] "Leave-one-chromosome-out is not applied" Single variance ratio is provided, so categorical variance ratio won't be used! variance Ratio is 0.8627349 isCondition is FALSE Setting genomic region chr21:1:250000000 Open VCF done To read the field GT Number of meta lines in the vcf file (lines starting with ##): 3426 Number of samples in in the vcf file: 3626 3626 sample IDs are found in the vcf file [1] 3626 4 [1] "IID" "IndexInModel" "IndexDose.x" "IndexDose.y" 1164 samples were used in fitting the NULL glmm model and are found in sample file sparse kinship matrix is not used Samples with missing dosages will be dropped from the analysis Analysis started at 1643432216 Seconds As minMAC is set to be 0, minMAC = 0.5 will be used minMAC: 0.5 minMAF: 0 Minimum MAF of markers to be tested is 0.0002147766 It is a binary trait Analyzing 377 cases and 787 controls isCondition is FALSE Analysis started at 1643432216 Seconds Removing 12 samples with missing dosages/genotypes Removing 14 samples with missing dosages/genotypes user system elapsed 1.473 0.195 2.222 [1] 2 numPassMarker: 2 Removing 66 samples with missing dosages/genotypes Removing 55 samples with missing dosages/genotypes user system elapsed 1.482 0.196 2.233 [1] 4 numPassMarker: 4 …. …. Removing 76 samples with missing dosages/genotypes Removing 89 samples with missing dosages/genotypes user system elapsed 2195.631 41.701 2439.301 [1] 722186 numPassMarker: 349330 closed the genofile! Analysis ended at 1643434653 Seconds Analysis took 2437.103 Seconds

gurinovich commented 2 years ago

We also re-ran it with the newest SAIGE version (SAIGE_0.45) and the results are the same. Please, advise. Thank you

weizhouUMICH commented 2 years ago

Hi @gurinovich,

Sorry fo the late reply! We have just released a new version 1.0.0. It has computational efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. Please feel free to try the version 1.0.0 and report issues if any.

Thanks! Wei