weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
188 stars 73 forks source link

missing data in step 2 #46

Closed IoannaTach closed 4 years ago

IoannaTach commented 6 years ago

I would like to use SAIGE with whole-exome data. The data are in plink format and contain missing values. I performed step 1 with no issues. For step 2, I understand I cannot use plink files. I reformatted my files into bgen format, but SAIGE complains due to the missing values. I am now reformatting my data into vcf, as I believe it's the only option for hard call genotypes for step 2? Would vcf files with missing data work for step 2 please? If not, do you have any suggestions? Many thanks!

IoannaTach commented 6 years ago

In the end the vcf file with hard called genotypes did not work either, as SAIGE again does not like missing data. What would you suggest please?

weizhouUMICH commented 6 years ago

Hi @IoannaTach, Thank you for your interest in using SAIGE! Sorry I haven't implemented the function to handle missing genotypes/dosages for step 2. I will do this soon. But to quickly solve your issue, one option is that you may use the PLINK option --fill-missing-a2 to impute the missing genotypes.

Thanks, Wei

IoannaTach commented 6 years ago

Thanks for the advice Wei. I did impute the missing genotypes with the plink option, but I am still getting the same error for missingness. I presume this is because I have missing phenotypes, does that sound right to you? I can down-sample the genotype data to non-missing phenotype values for one of my traits for testing. But I have a large number of traits to run with different trait missingness, and having to produce multiple datasets for each of them is not a good practice. Do you think you can extend SAIGE to handle missing genotypes and phenotype values? How long do you think that would take? Thank you!

weizhouUMICH commented 6 years ago

You are welcome. SAIGE is supposed to handle missing phenotypes, so you can use the same plink file for different phenotypes. Would you mind sharing the error message or log file so I have take a look?

Thanks, Wei

On Thu, Sep 13, 2018 at 11:18 AM, IoannaTach notifications@github.com wrote:

Thanks for the advice Wei. I did impute the missing genotypes with the plink option, but I am still getting the same error for missingness. I presume this is because I have missing phenotypes, does that sound right to you? I can down-sample the genotype data to non-missing phenotype values for one of my traits for testing. But I have a large number of traits to run with different trait missingness, and having to produce multiple datasets for each of them is not a good practice. Do you think you can extend SAIGE to handle missing genotypes and phenotype values? How long do you think that would take? Thank you!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weizhouUMICH/SAIGE/issues/46#issuecomment-421044584, or mute the thread https://github.com/notifications/unsubscribe-auth/AJfdL-T5Ef-RfYqCfSNpHCRipRXrkNL-ks5uanddgaJpZM4Whxjq .

IoannaTach commented 6 years ago

Error in try(if (length(which(opt == "")) > 0) stop("Missing arguments")) : Missing arguments 48567 samples have been used to fit the glmm null model variance Ratio is 0.98 49998 sample IDs are found in sample file Error in SPAGMMATtest(dosageFile = opt$dosageFile, dosageFileNrowSkip = opt$dosageFileNrowSkip, : ERROR!48567 samples used in glmm model fit do not have dosages Execution halted

IoannaTach commented 6 years ago

Thanks Wei for helping. In step1, I used the non-imputed plink files without any issues. For step2, I imputed the plink data and converted them into vcf.gz

weizhouUMICH commented 6 years ago

I see. It seems the sample ids specified in the sample file (step 2) do not match the sample ids in the fam fam (plink from step 1). Could you please double check the ids?

Thanks, Wei

On Thu, Sep 13, 2018 at 11:30 AM, IoannaTach notifications@github.com wrote:

Thanks Wei for helping. In step1, I used the non-imputed plink files without any issues. For step2, I imputed the plink data and converted them into vcf.gz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weizhouUMICH/SAIGE/issues/46#issuecomment-421048967, or mute the thread https://github.com/notifications/unsubscribe-auth/AJfdL3ZngyN36qn7lRWEIYVPUSvpu24Oks5uanokgaJpZM4Whxjq .

IoannaTach commented 6 years ago

You mean they don't match in terms of numbers?

weizhouUMICH commented 6 years ago

I mean in terms of ids themselves. The number can be different and the all plink ids need to be contained in the dosage ids.

On Thu, Sep 13, 2018 at 11:40 AM, IoannaTach notifications@github.com wrote:

You mean they don't match in terms of numbers?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/weizhouUMICH/SAIGE/issues/46#issuecomment-421052371, or mute the thread https://github.com/notifications/unsubscribe-auth/AJfdL8_o_x4bZ38EC0X7DQ8hCy9ceOQCks5uanyFgaJpZM4Whxjq .

IoannaTach commented 6 years ago

Hi Wei,

You were right. When you use link to export to vcf, plink combines the FID and ID into "FID_ID". I have now adjusted the sample IDs in the fam and phenotype files used for step1 and rerun step1 successfully. But when I try running step2, again I get the following error:

Rscript ${saigepath}/step2_SPAtests.R \

   --vcfFile=/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz \
   --vcfFileIndex=/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz.tbi \
   --vcfField=GT \
   --chrom=22 \
   --minMAF=0 \
   --minMAC=3 \
   --sampleFile=/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.nonRGENids.col1.imputed.sample \
   --GMMATmodelFile=${outpath}/${Trait}.rda \
   --varianceRatioFile=${outpath}/${Trait}.varianceRatio.txt \
   --SAIGEOutputFile=${outpath}/${Trait}.SAIGE.vcf.genotype.txt \
   --numLinesOutput=2 \
   --IsOutputAFinCaseCtrl=TRUE

Warning message: package ‘SAIGE’ was built under R version 3.4.3 $dosageFile [1] ""

$dosageFileNrowSkip [1] 0

$dosageFileNcolSkip [1] 5

$dosageFilecolnamesSkip [1] ""

$vcfFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz"

$vcfFileIndex [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.imputed.vcf.gz.tbi"

$vcfField [1] "GT"

$bgenFile [1] ""

$bgenFileIndex [1] ""

$savFile [1] ""

$savFileIndex [1] ""

$chrom [1] "22"

$start [1] 1

$end [1] 2.5e+08

$minMAF [1] 0

$minMAC [1] 3

$sampleFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/UKB_Freeze_Three.GL.splitmulti.nonRGENids.col1.imputed.sample"

$GMMATmodelFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/specialRequest_BIN_osteoarthritis_all/AllEthnicities/SAIGE/specialRequest_BIN_osteoarthritis_all.rda"

$varianceRatioFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/specialRequest_BIN_osteoarthritis_all/AllEthnicities/SAIGE/specialRequest_BIN_osteoarthritis_all.varianceRatio.txt"

$SAIGEOutputFile [1] "/GWD/appbase/projects/RD-TSci-UKB/UKBB_exomesdownload/Ioanna/specialRequest_BIN_osteoarthritis_all/AllEthnicities/SAIGE/specialRequest_BIN_osteoarthritis_all.SAIGE.vcf.genotype.txt"

$numLinesOutput [1] 2

$IsOutputAFinCaseCtrl [1] TRUE

$help [1] FALSE

Error in try(if (length(which(opt == "")) > 0) stop("Missing arguments")) : Missing arguments 48567 samples have been used to fit the glmm null model variance Ratio is 0.98 49998 sample IDs are found in sample file [1] 49998 4 [1] "IID" "IndexInModel" "IndexDose.x" "IndexDose.y" 48567 samples were used in fitting the NULL glmm model and are found in sample file minMAC: 3 minMAF: 0 Minimum MAF of markers to be testd is 3.09e-05 Analysis started at 1.54e+09 Seconds Open VCF done To read the field GT Number of meta lines in the vcf file (lines starting with ##): 4 Number of samples in in the vcf file: 49998 isVariant: TRUE It is a binary trait Analyzing 6650 cases and 41917 controls user system elapsed 20.8 3.9 23.1 [1] 2 numPassMarker: 0 user system elapsed 20.83 3.91 23.18 [1] 4 numPassMarker: 1 user system elapsed 20.88 3.91 23.25 [1] 6 numPassMarker: 2 user system elapsed 20.93 3.91 23.30 [1] 8

weizhouUMICH commented 4 years ago

Hi @IoannaTach,

It seems the association tests shown in your log file were run successfully. The error message is because that some arguments were not specified in the wrapper R script, which does not affect the actual SAIGE functions.

Thanks! Wei

oalavijeh commented 4 years ago

@IoannaTach I have been having similar trouble when I try and use a subset of specimens (its works if I use everyone the GRM was fitted to). I corrected the IDs from my plink file but still get the same error. Out of interest did your .fam file have 0/1 for phenotype or 1/2?

Many thanks

AfzalBioinfo commented 3 years ago

On 2nd Step, we tried to run the below command with our data of index file and input file

Rscript step2_SPAtests.R --vcfFile=./input/plate_1_to_6.vcf.gz --vcfFileIndex=./input/plate_1_to_6_dec_rec_ped_bprob.vcf.gz.tbi -- vcfField=GT --chrom=chrX --minMAF=0.0001 --minMAC=1 --sampleFile=./input/plate_1_to_6_dec_rec_ped_bprob_samples.txt --GMMATmodelFile=./output/plate_1_to_6_dec_rec_ped_bprob_pheno_step1.rda --varianceRatioFile=./output/plate_1_to_6_dec_rec_ped_bprob_pheno_step1.varianceRatio.txt --SAIGEOutputFile=./output/plate_1_to_6_dec_rec_ped_bprob_pheno_step2.SAIGE.vcf.dosage.txt --numLinesOutput=2 --IsOutputAFinCaseCtrl=TRUE

We are getting below error:

weights.beta.rare is 1 25 weights.beta.common is 1 25 cateVarRatioMinMACVecExclude is 0.5 1.5 2.5 3.5 4.5 5.5 10.5 20.5 cateVarRatioMaxMACVecInclude is 1.5 2.5 3.5 4.5 5.5 10.5 20.5 single-variant association test will be performed Garbage collection 14 = 8+2+4 (level 2) ... 79.1 Mbytes of cons cells used (59%) 19.8 Mbytes of vectors used (31%) 414 samples have been used to fit the glmm null model [1] "Leave-one-chromosome-out is not applied" Single variance ratio is provided, so categorical variance ratio won't be used! variance Ratio is 0.9662084 isCondition is FALSE Open VCF done To read the field DS Number of meta lines in the vcf file (lines starting with ##): 61 Number of samples in in the vcf file: 414 Variant is missing requested data format type. 414 sample IDs are found in the vcf file Error in SPAGMMATtest(vcfFile = opt$vcfFile, vcfFileIndex = opt$vcfFileIndex, : ERROR!414 samples used in glmm model fit do not have dosages Execution halted

If anyone can help us to solve this error.