weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
188 stars 73 forks source link

Empty Outputs for Continuous Trait Analysis #316

Closed bensesbg closed 2 years ago

bensesbg commented 3 years ago

Greetings! I was wondering if you might be able to help resolve an issue where we are attempting a continuous trait analysis in a large scale population (200k+) WES VCF for a single chromosome. The analysis appears to generate an empty output VCF that contains only the file header. Here is the output associated with the Step 2 of the analysis:

R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale: [1] C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SAIGE_0.43.3

loaded via a namespace (and not attached): [1] compiler_3.6.3 Matrix_1.2-18 Rcpp_1.0.5 grid_3.6.3
[5] RcppParallel_5.0.2 lattice_0.20-40

$vcfFile [1] "chr19.vcf.gz"

$vcfFileIndex [1] "chr19.vcf.gz.tbi" [1] "sampleID"

$GMMATmodelFile [1] "chr19.rda"

$varianceRatioFile [1] "chr19.varianceRatio.txt"

$SAIGEOutputFile [1] "chr19_SingleVar.txt"

$numLinesOutput [1] 10000

$IsSparse [1] TRUE

$SPAcutoff [1] 2

$IsOutputAFinCaseCtrl [1] FALSE

$IsOutputNinCaseCtrl [1] FALSE

$IsOutputHetHomCountsinCaseCtrl [1] FALSE

$LOCO [1] FALSE

$condition [1] ""

$sparseSigmaFile [1] "chr19.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx" 2021-02-06T01:54:38.109013502Z [W::hts_idx_load2] The index file is older than the data file: chr19.vcf.gz.tbi 2021-02-06T01:54:38.218467163Z [W::hts_idx_load2] The index file is older than the data file: chr19.vcf.gz.tbi Open VCF done To read the field GT Number of meta lines in the vcf file (lines starting with ##): 46 Number of samples in in the vcf file: 200643 [1] 200643 4 [1] "IID" "IndexInModel" "IndexDose.x" "IndexDose.y" 148149 samples were used in fitting the NULL glmm model and are found in sample file sparse kinship matrix is going to be used sparseSigmaFile: chr19.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx Missing dosages will be mean imputed for the analysis Analysis started at 1612576480 Seconds As minMAC is set to be 0, minMAC = 0.5 will be used minMAC: 0.5 minMAF: 0 Minimum MAF of markers to be tested is 1.68749e-06 It is a quantitative trait isCondition is FALSE Analysis started at 1612576480 Seconds user system elapsed 4.660 0.424 4.743 [1] 1 numPassMarker: 0 closed the genofile! Analysis ended at 1612576480 Seconds Analysis took 0.005542517 Seconds

The commands for generating the GLMM model were as follows:

Rscript step1_fitNULLGLMM.R --plinkFile=chr19 --phenoFile=pheno.csv --phenoCol=pheno --traitType=quantitative --invNormalize=TRUE --covarColList=sex,age,PC1-10 --sampleIDColinphenoFile=EID --nThreads=64 --LOCO=FALSE --outputPrefix=./chr19 --IsSparseKin=TRUE --sparseGRMFile=relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx --sparseGRMSampleIDFile=relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt --isCateVarianceRatio=TRUE

If it would be helpful, I can try to obtain the logs from the first step. However, the model worked successfully for analyzing the same VCF with a binary trait.

We did encounter a similar issue with a test dataset that we thought might be related to missing genotypes, however we were able to isolate it as related to discrepancies between the REF/ALT coding between the VCF file and the group information file. In that case the VCF had the following format: GENE1 1:4_A/C 1:9_A/C 1:10_A/C 1:11_A/C

And the group file was coded as follows: GENE1 1:4_1/2 1:9_1/2 1:10_1/2 1:11_1/2

However, we don't seem to be observing anything like this in the run described above.

The VCF was produced by preprocessing a WES VCF into chromosome chunks with Bcftools and PLINK2.For this, genotypes were masked (set to missing, “./.“) based on GQ and DP FORMAT field values, all FORMAT fields other than GT were stripped, multi-allelic sites were split and realigned (which appears to have introduced a few duplicate variants), some variants were removed based on GT missingness/HWE cutoffs, and the INFO fields were recalculated.

There were 13 variants created in the realignment throughout the QCed VCF chunks, but none of them were in the problematic chr19 chunk.

We are wondering if there is any information we might be missing here, or something you see in the logs that could provide a better clue as to what is causing the issue? Thank you very much for your assistance.

weizhouUMICH commented 3 years ago

Hi @bensesbg,

Can you please check if you have specified the --chr for the vcf file? the string needs to be matched to the CHROM column in the vcf file.
Thanks, Wei

bensesbg commented 3 years ago

@weizhouUMICH I don't think this is the issue since we were able to obtain successful results in other runs. Is my understanding correct that if the --chrom= is not specified then the entire VCF will be used? Because the input VCF just contains variants from chr19.

weizhouUMICH commented 2 years ago

Sorry fo the late reply! We have just released a new version 1.0.0. It has substantial computational efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests and clearer log output. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.

Thanks! Wei