weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
187 stars 72 forks source link

issue with SAIGE-GENE region based test step 2 #191

Closed laisktrn closed 4 years ago

laisktrn commented 4 years ago

Dear Wei,

I really appreciate your work with SAIGE, it's a great tool!

I'm trying to run gene-based analysis (missense variants for each gene filtered for imputation INFO scores), but receive an error for some specific chromosomes/specific genes. For example, the error for chr19 is:

geneID: FUT6 std::size_t sample_size = marker_file.samples().size();154201 cntMarker: 4 indexforMissing:
isCondition is FALSE weights: 24.9981 24.9981 24.9981 24.8652 Error in GratioMatrixall[1:m, 1:m] : subscript out of bounds Calls: SPAGMMATtest ... groupTest -> system.time -> SAIGE_SKAT_withRatioVec Timing stopped at: 0.8 0.346 0.746 Execution halted

So it seems to me that there is something with the markers in this gene that causes the program to stop at this point (no problems with previous genes). I have the following options in use:

module load broadwell/SAIGE/0.38

Rscript /gpfs/hpc/home/a30698/SCRIPTS/for_saige_038/step2_SPAtests.R \ --vcfFile=chr19.vcf.gz \ --vcfFileIndex=chr19.vcf.gz.tbi \ --chrom=19 \ --vcfField=DS \ --maxMAFforGroupTest=0.01 \ --sampleFile=imputed_samples_on_correct_order_for_saige.txt \ --GMMATmodelFile=rheuma_gene_nullglmm.rda \ --groupFile=chr19.missense.info_0.8.grp \ --varianceRatioFile=rheuma_gene_cate.varianceRatio.txt \ --sparseSigmaFile=rheuma_gene_cate.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx \ --cateVarRatioMinMACVecExclude=1,3,5,10,15,20 \ --cateVarRatioMaxMACVecInclude=3,5,10,15,20 \ --SAIGEOutputFile=chr19_missense_info08_test180520.txt \ --IsSingleVarinGroupTest=FALSE \

Some other things probably worth mentioning: had the same error with version 0.36.3 other chromosomes run without problems the same problem appears with several phenotypes the group file for this specific gene looks like this: FUT6 19:5831525_G/A 19:5831576_C/T 19:5831602_C/T 19:5831840_C/T 19:5832209_G/A 19:5832214_C/G 19:5832238_T/G 19:5832245_C/T 19:5832562_G/A all works fine if instead I use --vcfField=GT (Unfortunately we would like to use DS field since it's imputed data)

Here's the beginning of the log: R version 3.6.1 (2019-07-05) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /gpfs/software/soft/manual/broadwell/SAIGE/0.38/lib/R/lib/libRblas.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SAIGE_0.38

loaded via a namespace (and not attached): [1] compiler_3.6.1 Matrix_1.2-17 Rcpp_1.0.1 grid_3.6.1
[5] RcppParallel_4.4.4 lattice_0.20-41
$vcfFile [1] "chr19.vcf.gz"

$vcfFileIndex [1] "chr19.vcf.gz.tbi"

$vcfField [1] "DS"

$bgenFile [1] ""

$bgenFileIndex [1] ""

$savFile [1] ""

$savFileIndex [1] ""

$idstoExcludeFile [1] ""

$idstoIncludeFile [1] ""

$rangestoExcludeFile [1] ""

$rangestoIncludeFile [1] ""

$chrom [1] "19"

$start [1] 1

$end [1] 2.5e+08

$IsDropMissingDosages [1] FALSE

$minMAF [1] 0

$minMAC [1] 0

$maxMAFforGroupTest [1] 0.01

$minInfo [1] 0

$sampleFile [1] "imputed_samples_on_correct_order_for_saige.txt"

$GMMATmodelFile [1] "rheuma_gene_nullglmm.rda"

$varianceRatioFile [1] "rheuma_gene_cate.varianceRatio.txt"

$SAIGEOutputFile [1] "chr19_missense_info08_test180520.txt"

$numLinesOutput [1] 10000

$IsSparse [1] TRUE

$SPAcutoff [1] 2

$IsOutputAFinCaseCtrl [1] FALSE

$IsOutputNinCaseCtrl [1] FALSE

$IsOutputHetHomCountsinCaseCtrl [1] FALSE

$LOCO [1] FALSE

$condition [1] ""

$sparseSigmaFile [1] "rheuma_gene_cate.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx"

$groupFile [1] "chr19.missense.info_0.8.grp"

$kernel [1] "linear.weighted"

$method [1] "optimal.adj"

$weights.beta.rare [1] "1,25"

$weights.beta.common [1] "1,25"

$weightMAFcutoff [1] 0.01

$r.corr [1] "0"

$IsSingleVarinGroupTest [1] FALSE

$cateVarRatioMinMACVecExclude [1] "1,3,5,10,15,20"

$cateVarRatioMaxMACVecInclude [1] "3,5,10,15,20"

$dosageZerodCutoff [1] 0.2

$IsOutputPvalueNAinGroupTestforBinary [1] FALSE

$IsAccountforCasecontrolImbalanceinGroupTest [1] TRUE

$weightsIncludeinGroupFile [1] FALSE

$IsOutputBETASEinBurdenTest [1] FALSE

$help [1] FALSE

weights.beta.rare is 1 25 weights.beta.common is 1 25 cateVarRatioMinMACVecExclude is 1 3 5 10 15 20 cateVarRatioMaxMACVecInclude is 3 5 10 15 20 group-based test will be performed Any dosages <= 0.2 for genetic variants with MAC <= 10 are set to be 0 in group tests 138584 samples have been used to fit the glmm null model obj.glmm.null$LOCO: FALSE Leave-one-chromosome-out option is not applied variance Ratio is 1.001349 1.001309 1.001213 1.001075 1.000969 0.9722558 isCondition is FALSE Open VCF done To read the field DS Number of meta lines in the vcf file (lines starting with ##): 13 Number of samples in the vcf file: 154201 154201 sample IDs are found in the vcf file [1] 154201 4 [1] "IID" "IndexInModel" "IndexDose.x" "IndexDose.y" 138584 samples were used in fitting the NULL glmm model and are found in sample file sparse kinship matrix is going to be used sparseSigmaFile: rheuma_gene_cate.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx Missing dosages will be mean imputed for the analysis Analysis started at 1589805190 Seconds As minMAC is set to be 0, minMAC = 0.5 will be used minMAC: 0.5 minMAF: 0 Minimum MAF of markers to be tested is 1.80396e-06 It is a binary trait Analyzing 8061 cases and 130523 controls isCondition is FALSE Analysis started at 1589805190 Seconds genetic variants with 1.80396e-06 <= MAF <= 0.01 are included for gene-based tests It is a binary trait Case-control imbalance is adjusted for binary traits. isCondition is FALSE

Please let me know if you need some other information!

Best wishes Triin

weizhouUMICH commented 4 years ago

Hi Trin,

Thank you for using SAIGE and reporting this issue! I think the error is because of some extremely rare variants with 0.5 <= MAC <= 1. The variance ratios were only estimated for variants with MAC > 1 based on the parameters used below

--cateVarRatioMinMACVecExclude=1,3,5,10,15,20 --cateVarRatioMaxMACVecInclude=3,5,10,15,20

For those variants with 0.5 <= MAC <= 1, there is not variance ratio for them to use.

You may increase in the minMAC or minMAF to exclude those markers from the tests or use --cateVarRatioMinMACVecExclude=0.5,3,5,10,15,20 instead to estimate the variance ratios.

Thanks, Wei

laisktrn commented 4 years ago

Hi Wei,

thanks, that worked nicely!

Best wishes Triin