Closed IoannaTach closed 4 years ago
In the end the vcf file with hard called genotypes did not work either, as SAIGE again does not like missing data. What would you suggest please?
Hi @IoannaTach, Thank you for your interest in using SAIGE! Sorry I haven't implemented the function to handle missing genotypes/dosages for step 2. I will do this soon. But to quickly solve your issue, one option is that you may use the PLINK option --fill-missing-a2 to impute the missing genotypes.
Thanks, Wei
Thanks for the advice Wei. I did impute the missing genotypes with the plink option, but I am still getting the same error for missingness. I presume this is because I have missing phenotypes, does that sound right to you? I can down-sample the genotype data to non-missing phenotype values for one of my traits for testing. But I have a large number of traits to run with different trait missingness, and having to produce multiple datasets for each of them is not a good practice. Do you think you can extend SAIGE to handle missing genotypes and phenotype values? How long do you think that would take? Thank you!
You are welcome. SAIGE is supposed to handle missing phenotypes, so you can use the same plink file for different phenotypes. Would you mind sharing the error message or log file so I have take a look?
Thanks, Wei
On Thu, Sep 13, 2018 at 11:18 AM, IoannaTach notifications@github.com wrote:
Thanks for the advice Wei. I did impute the missing genotypes with the plink option, but I am still getting the same error for missingness. I presume this is because I have missing phenotypes, does that sound right to you? I can down-sample the genotype data to non-missing phenotype values for one of my traits for testing. But I have a large number of traits to run with different trait missingness, and having to produce multiple datasets for each of them is not a good practice. Do you think you can extend SAIGE to handle missing genotypes and phenotype values? How long do you think that would take? Thank you!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weizhouUMICH/SAIGE/issues/46#issuecomment-421044584, or mute the thread https://github.com/notifications/unsubscribe-auth/AJfdL-T5Ef-RfYqCfSNpHCRipRXrkNL-ks5uanddgaJpZM4Whxjq .
Error in try(if (length(which(opt == "")) > 0) stop("Missing arguments")) : Missing arguments 48567 samples have been used to fit the glmm null model variance Ratio is 0.98 49998 sample IDs are found in sample file Error in SPAGMMATtest(dosageFile = opt$dosageFile, dosageFileNrowSkip = opt$dosageFileNrowSkip, : ERROR!48567 samples used in glmm model fit do not have dosages Execution halted
Thanks Wei for helping. In step1, I used the non-imputed plink files without any issues. For step2, I imputed the plink data and converted them into vcf.gz
I see. It seems the sample ids specified in the sample file (step 2) do not match the sample ids in the fam fam (plink from step 1). Could you please double check the ids?
Thanks, Wei
On Thu, Sep 13, 2018 at 11:30 AM, IoannaTach notifications@github.com wrote:
Thanks Wei for helping. In step1, I used the non-imputed plink files without any issues. For step2, I imputed the plink data and converted them into vcf.gz
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weizhouUMICH/SAIGE/issues/46#issuecomment-421048967, or mute the thread https://github.com/notifications/unsubscribe-auth/AJfdL3ZngyN36qn7lRWEIYVPUSvpu24Oks5uanokgaJpZM4Whxjq .
You mean they don't match in terms of numbers?
I mean in terms of ids themselves. The number can be different and the all plink ids need to be contained in the dosage ids.
On Thu, Sep 13, 2018 at 11:40 AM, IoannaTach notifications@github.com wrote:
You mean they don't match in terms of numbers?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weizhouUMICH/SAIGE/issues/46#issuecomment-421052371, or mute the thread https://github.com/notifications/unsubscribe-auth/AJfdL8_o_x4bZ38EC0X7DQ8hCy9ceOQCks5uanyFgaJpZM4Whxjq .
Hi Wei,
You were right. When you use link to export to vcf, plink combines the FID and ID into "FID_ID". I have now adjusted the sample IDs in the fam and phenotype files used for step1 and rerun step1 successfully. But when I try running step2, again I get the following error:
Rscript ${saigepath}/step2_SPAtests.R \
--vcfFile=/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz \ --vcfFileIndex=/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz.tbi \ --vcfField=GT \ --chrom=22 \ --minMAF=0 \ --minMAC=3 \ --sampleFile=/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.nonRGENids.col1.imputed.sample \ --GMMATmodelFile=${outpath}/${Trait}.rda \ --varianceRatioFile=${outpath}/${Trait}.varianceRatio.txt \ --SAIGEOutputFile=${outpath}/${Trait}.SAIGE.vcf.genotype.txt \ --numLinesOutput=2 \ --IsOutputAFinCaseCtrl=TRUE
Warning message: package ‘SAIGE’ was built under R version 3.4.3 $dosageFile [1] ""
$dosageFileNrowSkip [1] 0
$dosageFileNcolSkip [1] 5
$dosageFilecolnamesSkip [1] ""
$vcfFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz"
$vcfFileIndex [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz.tbi"
$vcfField [1] "GT"
$bgenFile [1] ""
$bgenFileIndex [1] ""
$savFile [1] ""
$savFileIndex [1] ""
$chrom [1] "22"
$start [1] 1
$end [1] 2.5e+08
$minMAF [1] 0
$minMAC [1] 3
$sampleFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.nonRGENids.col1.imputed.sample"
$GMMATmodelFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/specialRequest_BIN_osteoarthritis_all/AllEthnicities/SAIGE/specialRequest_BIN_osteoarthritis_all.rda"
$varianceRatioFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/specialRequest_BIN_osteoarthritis_all/AllEthnicities/SAIGE/specialRequest_BIN_osteoarthritis_all.varianceRatio.txt"
$SAIGEOutputFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/specialRequest_BIN_osteoarthritis_all/AllEthnicities/SAIGE/specialRequest_BIN_osteoarthritis_all.SAIGE.vcf.genotype.txt"
$numLinesOutput [1] 2
$IsOutputAFinCaseCtrl [1] TRUE
$help [1] FALSE
Error in try(if (length(which(opt == "")) > 0) stop("Missing arguments")) : Missing arguments 48567 samples have been used to fit the glmm null model variance Ratio is 0.98 49998 sample IDs are found in sample file [1] 49998 4 [1] "IID" "IndexInModel" "IndexDose.x" "IndexDose.y" 48567 samples were used in fitting the NULL glmm model and are found in sample file minMAC: 3 minMAF: 0 Minimum MAF of markers to be testd is 3.09e-05 Analysis started at 1.54e+09 Seconds Open VCF done To read the field GT Number of meta lines in the vcf file (lines starting with ##): 4 Number of samples in in the vcf file: 49998 isVariant: TRUE It is a binary trait Analyzing 6650 cases and 41917 controls user system elapsed 20.8 3.9 23.1 [1] 2 numPassMarker: 0 user system elapsed 20.83 3.91 23.18 [1] 4 numPassMarker: 1 user system elapsed 20.88 3.91 23.25 [1] 6 numPassMarker: 2 user system elapsed 20.93 3.91 23.30 [1] 8
Hi @IoannaTach,
It seems the association tests shown in your log file were run successfully. The error message is because that some arguments were not specified in the wrapper R script, which does not affect the actual SAIGE functions.
Thanks! Wei
@IoannaTach I have been having similar trouble when I try and use a subset of specimens (its works if I use everyone the GRM was fitted to). I corrected the IDs from my plink file but still get the same error. Out of interest did your .fam file have 0/1 for phenotype or 1/2?
Many thanks
On 2nd Step, we tried to run the below command with our data of index file and input file
Rscript step2_SPAtests.R --vcfFile=./input/plate_1_to_6.vcf.gz --vcfFileIndex=./input/plate_1_to_6_dec_rec_ped_bprob.vcf.gz.tbi -- vcfField=GT --chrom=chrX --minMAF=0.0001 --minMAC=1 --sampleFile=./input/plate_1_to_6_dec_rec_ped_bprob_samples.txt --GMMATmodelFile=./output/plate_1_to_6_dec_rec_ped_bprob_pheno_step1.rda --varianceRatioFile=./output/plate_1_to_6_dec_rec_ped_bprob_pheno_step1.varianceRatio.txt --SAIGEOutputFile=./output/plate_1_to_6_dec_rec_ped_bprob_pheno_step2.SAIGE.vcf.dosage.txt --numLinesOutput=2 --IsOutputAFinCaseCtrl=TRUE
We are getting below error:
weights.beta.rare is 1 25 weights.beta.common is 1 25 cateVarRatioMinMACVecExclude is 0.5 1.5 2.5 3.5 4.5 5.5 10.5 20.5 cateVarRatioMaxMACVecInclude is 1.5 2.5 3.5 4.5 5.5 10.5 20.5 single-variant association test will be performed Garbage collection 14 = 8+2+4 (level 2) ... 79.1 Mbytes of cons cells used (59%) 19.8 Mbytes of vectors used (31%) 414 samples have been used to fit the glmm null model [1] "Leave-one-chromosome-out is not applied" Single variance ratio is provided, so categorical variance ratio won't be used! variance Ratio is 0.9662084 isCondition is FALSE Open VCF done To read the field DS Number of meta lines in the vcf file (lines starting with ##): 61 Number of samples in in the vcf file: 414 Variant is missing requested data format type. 414 sample IDs are found in the vcf file Error in SPAGMMATtest(vcfFile = opt$vcfFile, vcfFileIndex = opt$vcfFileIndex, : ERROR!414 samples used in glmm model fit do not have dosages Execution halted
If anyone can help us to solve this error.
I would like to use SAIGE with whole-exome data. The data are in plink format and contain missing values. I performed step 1 with no issues. For step 2, I understand I cannot use plink files. I reformatted my files into bgen format, but SAIGE complains due to the missing values. I am now reformatting my data into vcf, as I believe it's the only option for hard call genotypes for step 2? Would vcf files with missing data work for step 2 please? If not, do you have any suggestions? Many thanks!