weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
191 stars 73 forks source link

Fractional part in AC from step 2? #142

Closed Ojami closed 4 years ago

Ojami commented 4 years ago

Dear Wei,

I'm using latest version of SAIGE (0.36.3.1) for gen-based analysis:

SPAGMMATtest(vcfFile = "myvcf.vcf.gz", vcfFileIndex = "myvcf.vcf.gz.tbi", vcfField = "GT",chrom = "0", minMAF = 0, minMAC = 3, LOCO = FALSE,maxMAFforGroupTest = 0.01,GMMATmodelFile = "step1_nullmodel.rda",varianceRatioFile = "step1_nullmodel.varianceRatio.txt",SAIGEOutputFile = "myresults.SAIGEspa.gene.txt",numLinesOutput = 1,sparseSigmaFile = "step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx",groupFile = "mygroup.groupFile.txt",IsSingleVarinGroupTest = TRUE,IsOutputAFinCaseCtrl = TRUE,IsOutputNinCaseCtrl = TRUE,IsOutputBETASEinBurdenTest = TRUE,condition = "",sampleFile = "sampleID.step2.txt")

In my single variant analysis results, I see some allele counts have fractional parts e.g. 66.0192 (my vcf file is 'GT' not dosage). Using PLINK I don't see such numbers in the allele counts. Can you please explain what could be the reason?

Thanks! Oveis

weizhouUMICH commented 4 years ago

Hi Oveis,

This is quite interesting! I haven't seen this issue before. The allele counts you get are from the single-variant association tests within the gene-based tests? If you run single-variant association tests only, did you see the same allele counts?

Thanks, Wei

Ojami commented 4 years ago

Hi Wei,

Thank for your response. Yes, they are from single variant tests within gene based test. When I run single-variant association, I don't see any fractional parts any more!

SPAGMMATtest(vcfFile = "myvcf.vcf.gz", vcfFileIndex = "myvcf.vcf.gz.tbi", vcfField = "GT",chrom = "8", minMAF = 0, minMAC = 3,GMMATmodelFile = "step1_nullmodel.rda",varianceRatioFile = "step1_nullmodel.varianceRatio.txt",SAIGEOutputFile = "SAIGEspa.gene.txt",numLinesOutput = 1,sparseSigmaFile = "step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx",IsOutputAFinCaseCtrl = TRUE,sampleFile = "sampleID.step2.txt")

I'm worried if this may affect burden/SKAT/SKAT-O p-values as well?

Thanks! Best/Oveis

weizhouUMICH commented 4 years ago

Dear Oveis,

I tried the script in the example extdata/cmd.sh to perform group tests using genotype and couldn't replicate the problem.

Rscript step2_SPAtests.R \
        --vcfFile=./input/genotype_10markers.vcf.gz \
        --vcfFileIndex=./input/genotype_10markers.vcf.gz.tbi \
        --vcfField=GT \
        --chrom=1 \
        --minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --sampleFile=./input/samplelist.txt \
        --GMMATmodelFile=./output/example_quantitative.rda \
        --varianceRatioFile=./output/example_quantitative_cate.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_quantitative.SAIGE.gene.txt \
        --numLinesOutput=1 \
        --groupFile=./input/groupFile_geneBasedtest_simulation.txt    \
        --sparseSigmaFile=./output/example_quantitative_cate.varianceRatio.txt_relatednessCutoff_0.125.sparseSigma.mtx       \
        --IsSingleVarinGroupTest=TRUE

Are there any missing genotypes in your input files? The option --IsDropMissingDosages can be set as TRUE to drop samples with missing genotypes or dosages, otherwise, the missing ones will be mean imputed. I wonder if it is possible that in the group test, --IsDropMissingDosage is set to FALSE? If not, could you please provide some example data that I can use to replicate the error and debug?

Thanks! Wei

Ojami commented 4 years ago

Dear Wei, Thanks for your response. I guess --IsDropMissingDosages flag was part of the problem, because now using the following code:

SPAGMMATtest(vcfFile = "ukb.efe.qc.lof.vcf.gz", vcfFileIndex = "ukb.efe.qc.lof.vcf.gz.tbi", vcfField = "GT",chrom = "0", minMAF = 0, minMAC = 0.5, LOCO = FALSE,maxMAFforGroupTest = 0.01,GMMATmodelFile = "step1_nullmodel.rda",varianceRatioFile = "step1_nullmodel.varianceRatio.txt",SAIGEOutputFile = "ukb.efe.qc.lof.SAIGEspa.gene.txt",numLinesOutput = 1,sparseSigmaFile = "step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx",groupFile = "ukb.efe.qc.lof.groupFile.txt",IsSingleVarinGroupTest = TRUE,IsOutputAFinCaseCtrl = TRUE,IsOutputNinCaseCtrl = TRUE,IsOutputBETASEinBurdenTest = TRUE,condition = "",sampleFile = "ukb.efe.qc.lof.sampleID.step2.txt",IsDropMissingDosages = TRUE)

I get the following error:

group-based test will be performed Any dosages <= 0.2 for genetic variants with MAC <= 10 are set to be 0 in group tests 44147 samples have been used to fit the glmm null model obj.glmm.null$LOCO: FALSE Leave-one-chromosome-out option is not applied variance Ratio is 1 1 1 1 1 1 1 1 44284 sample IDs are found in sample file [1] 44284 4 [1] ""IID"" ""IndexInModel"" ""IndexDose.x"" ""IndexDose.y"" 44147 samples were used in fitting the NULL glmm model and are found in sample file sparse kinship matrix is going to be used sparseSigmaFile: step1_nullmodel.varianceRatio.txt_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseSigma.mtx Samples with missing dosages will be dropped from the analysis Analysis started at 1582578030 Seconds minMAC: 0.5 minMAF: 0 Minimum MAF of markers to be tested is 5.662899e-06 isCondition is FALSE It is a binary trait Analyzing 138 cases and 44009 controls isCondition is FALSE Analysis started at 1582578030 Seconds genetic variants with 5.662899e-06 <= MAF <= 0.01 are included for gene-based tests Open VCF done To read the field GT Number of meta lines in the vcf file (lines starting with ##): 28 Number of samples in the vcf file: 44284 It is a binary trait Case-control imbalance is adjusted for binary traits. isCondition is FALSE geneID: PSD3 std::size_t sample_size = marker_file.samples().size();44284 missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! missing_cnt > 0! cntMarker: 240 isCondition is FALSE weights: 24.99252 24.99252 24.99252 24.99252 24.98505 0 24.96263 24.99252 24.98505 0 0 24.99252 24.99252 24.99252 0 24.99252 24.95517 24.99252 24.99252 24.99252 24.99252 24.99252 24.98505 24.98505 24.98505 24.99252 24.95517 24.99252 24.99252 24.99252 24.97757 24.99252 24.99252 24.98505 0 0 0 24.99252 24.99252 24.99252 24.99252 24.95517 24.99252 24.97757 24.99252 24.99252 24.98505 24.99252 24.99252 24.99252 24.99252 24.98505 24.99252 24.99252 24.9701 24.98505 24.98505 0 0 24.98505 24.98505 24.9701 24.98505 0 24.98505 24.91041 24.9701 24.97757 24.98505 24.99252 24.99252 24.83598 24.99252 24.98505 24.99252 0 0 24.9701 24.98505 24.99252 24.99252 24.99252 24.99252 24.98505 24.99252 24.98505 24.99252 24.98505 24.99252 24.99252 0 24.99252 24.89551 24.99252 24.96263 24.45239 24.97757 24.91041 24.95517 24.98505 24.97757 24.99252 24.99252 24.89551 24.97757 24.99252 0 24.99252 24.99252 0 0 24.99252 24.99252 24.99252 24.99252 24.9701 0 24.99252 24.9701 24.99252 24.99252 24.98505 24.99252 24.99252 0 24.96263 0 24.99252 24.99252 0 0 24.60662 24.97757 24.98505 24.56246 24.97757 24.9477 24.97757 0 24.99252 24.98505 24.9701 24.9701 24.99252 24.99252 24.99252 24.53306 24.98505 24.99252 24.99252 24.95517 24.98505 24.9701 24.98505 24.99252 23.74486 24.99252 24.99252 24.99252 24.99252 0 0 24.99252 24.95517 24.97757 24.99252 24.99252 24.99252 24.98505 24.99252 24.98505 24.99252 24.95517 24.98505 24.99252 24.78401 24.99252 24.68777 0 24.9477 0 24.7766 24.56246 24.97757 24.92532 24.95517 24.99252 24.91786 24.99252 24.99252 24.99252 24.99252 24.97757 0 0 0 24.97757 24.98505 24.99252 24.99252 24.99252 24.99252 24.99252 24.99252 24.99252 24.91786 0 0 24.98505 0 24.99252 24.98505 0 24.24821 24.99252 24.99252 24.97757 24.99252 24.99252 24.99252 24.99252 24.99252 24.99252 24.98505 24.99252 24.9477 24.98505 24.98505 24.9701 24.99252 24.99252 24.99252 24.99252 24.99252 0 0 24.98505 24.99252 24.99252 24.98505 Error in GratioMatrixall[1:m, 1:m] : subscript out of bounds Calls: SPAGMMATtest ... groupTest -> system.time -> SAIGE_SKAT_withRatioVec Timing stopped at: 2.953 0 2.951 Execution halted

I would like to share the data, but these data are from UK Biobank WES so, unfortunately, I cannot publicly share them. What do you think about the source of error?

UPDATE

I checked my groupFile, and realized that the error arises from variants on the same position. For instance in my example above, having these both variants generates error:

8:18535928_G/C 8:18535928_G/A

Removing any of these variants, would resolve the issue. So, given this, what do you think Wei to resolve the issue? Is this a bug?

Thanks in advance, Best/Oveis

weizhouUMICH commented 4 years ago

Hi Oveis,

Thank you very much for the update! Very useful for debugging. The two markers with the same position only cause the problem when --IsDropMissingDosage=T?

Thanks, Wei

Ojami commented 4 years ago

Hi Wei,

For that specific gene, it seems that variants with same positions generate that error. Nonetheless, I am going to check this more in details to see if there are other variants may cause that error, or only same positions causes this error. And yes, IsDropMissingDosage has been set to True in my example.

Best Oveis

weizhouUMICH commented 4 years ago

Thank you Oveis! In your group file, could you please try change the order of the two markers in the group file? Thanks! Wei

Ojami commented 4 years ago

I swapped the orders of those two markers (also another two markers with the same position on the same gene!), but still SAIGE throws the same error. Just to be sure: if the original groupFile was:

GENE marker1 marker2 marker3_A marker3_B marker4

I modified it in this way:

GENE marker1 marker2 marker3_B marker3_A marker4

Best/ Oveis

Ojami commented 4 years ago

Dear Wei,

I checked my example deeper, and the following is the summary of what I got:

  1. In my example, the gene has 10 pairs of markers (totally 20), having the same position on chromosome. Removing any of the markers, resolves the problem.

  2. Changing the order of these problematic markers resolves the issue only and only if, at least one marker comes between them. For instance, the following order doesn't generate any error:

GENE marker1 marker2 marker3_A marker4 marker3_B

But, it creates another problem: the results (p-values, etc.) of both SKAT and single variant tests change! So, the order of variant is greatly affects the results here (why is that?)

  1. The problem mentioned in 2 also exists when there is no overlapping variants as well. For isntance the results of both rare variant (SKAT, Burden, SKAT-O) and single variant tests greatly differ in the following two cases:

GENE marker1 marker2 marker3 marker4 GENE marker1 marker4 marker2 marker3

I can share the marker ids (my groupFile) for this specific gene as well.

I appreciate your help in advance, because now I am very confused 😞

Best/Oveis

GENE_NAME 8:18535748_T/G 8:18535780_C/A 8:18535781_G/A 8:18535811_T/C 8:18535826_T/C 8:18535864_G/T 8:18535887_G/T 8:18535893_A/T 8:18535908_C/G 8:18535928_G/C 8:18535928_G/A 8:18556213_A/G 8:18556247_C/T 8:18556270_T/C 8:18556274_C/T 8:18556304_C/T 8:18556343_G/T 8:18572563_G/A 8:18572570_A/C 8:18572574_T/A 8:18572575_T/A 8:18572577_T/C 8:18572580_T/C 8:18572587_C/T 8:18572590_T/C 8:18572598_G/A 8:18572605_G/C 8:18572629_C/T 8:18572631_C/T 8:18572636_G/C 8:18572652_C/T 8:18572664_T/C 8:18572670_C/T 8:18575219_C/T 8:18575234_C/T 8:18575234_C/G 8:18575237_C/T 8:18575252_C/T 8:18575257_G/C 8:18575264_C/T 8:18575267_T/C 8:18575278_T/G 8:18575280_T/G 8:18600396_C/T 8:18632664_T/C 8:18632685_C/T 8:18632687_T/A 8:18632688_G/A 8:18632690_G/A 8:18632697_C/A 8:18632712_T/C 8:18632726_C/T 8:18632763_C/T 8:18632773_TTC/T 8:18632780_G/A 8:18632781_G/A 8:18632783_G/C 8:18632785_CT/C 8:18632785_C/CT 8:18765465_G/C 8:18765475_C/T 8:18765477_C/T 8:18765505_T/C 8:18765516_C/A 8:18765517_A/G 8:18765534_A/G 8:18765535_T/C 8:18799350_C/T 8:18801284_G/A 8:18801286_A/C 8:18801288_C/G 8:18801311_T/C 8:18801324_T/G 8:18801339_G/A 8:18801363_A/G 8:18801364_G/C 8:18801364_G/T 8:18801368_G/T 8:18804540_G/A 8:18804558_A/G 8:18804579_G/A 8:18804586_G/C 8:18804595_A/G 8:18804598_C/T 8:18804600_T/C 8:18804737_T/C 8:18804739_G/C 8:18804746_T/C 8:18804756_A/G 8:18804762_G/A 8:18804772_C/A 8:18804782_G/A 8:18804809_C/T 8:18804818_T/G 8:18804826_C/A 8:18804845_G/A 8:18804849_T/A 8:18804860_G/A 8:18804875_C/G 8:18804876_G/A 8:18804888_C/T 8:18808761_C/T 8:18808764_T/C 8:18808773_C/A 8:18808776_G/A 8:18867694_CG/C 8:18867707_G/A 8:18867713_A/C 8:18867714_T/C 8:18867720_C/T 8:18867740_T/C 8:18867747_C/A 8:18867752_C/T 8:18867758_T/C 8:18867762_C/G 8:18867768_C/T 8:18867796_G/C 8:18867798_T/A 8:18867801_C/A 8:18867828_C/G 8:18867829_C/G 8:18867832_G/T 8:18867846_C/G 8:18867854_C/G 8:18867866_A/T 8:18867867_T/C 8:18867880_C/T 8:18867890_C/T 8:18867899_C/T 8:18867906_G/T 8:18867906_G/A 8:18867920_G/A 8:18867929_G/A 8:18867968_G/T 8:18867977_T/C 8:18867990_C/A 8:18867996_T/G 8:18868007_G/A 8:18868019_C/T 8:18868022_A/T 8:18868036_T/A 8:18868055_T/A 8:18868056_C/T 8:18868059_G/T 8:18868062_C/T 8:18868067_A/T 8:18871638_C/G 8:18871678_C/T 8:18871680_T/C 8:18871687_A/G 8:18871716_C/T 8:18871717_G/A 8:18871740_C/G 8:18871743_G/A 8:18871780_C/G 8:18871783_A/C 8:18871818_C/A 8:18871840_G/A 8:18871857_T/C 8:18871863_C/T 8:18871864_T/C 8:18871888_G/C 8:18871908_G/A 8:18871908_G/T 8:18871915_G/T 8:18871927_T/C 8:18871933_C/T 8:18871941_C/G 8:18871954_C/T 8:18871968_A/G 8:18871969_CAT/C 8:18871980_C/T 8:18871983_C/T 8:18871986_G/A 8:18871998_G/A 8:18872007_C/T 8:18872017_C/T 8:18872019_C/G 8:18872040_G/A 8:18872067_A/G 8:18872071_T/C 8:18872083_G/A 8:18872098_A/G 8:18872124_G/C 8:18872128_G/T 8:18872137_C/T 8:18872155_T/C 8:18872167_T/C 8:18872208_G/A 8:18872211_G/C 8:18872223_T/C 8:18872229_C/A 8:18872244_T/A 8:18872250_G/A 8:18872304_A/G 8:18872307_A/G 8:18872308_G/T 8:18872313_T/C 8:18872314_T/A 8:18872331_C/T 8:18872332_G/A 8:18872332_G/T 8:18872338_C/T 8:18872341_T/C 8:18872350_C/G 8:18872389_C/G 8:18872391_T/C 8:18872395_C/A 8:18872397_A/G 8:18872402_C/G 8:18872414_T/A 8:18872422_C/T 8:18872491_C/T 8:18872491_C/G 8:18872499_T/C 8:18872501_A/T 8:18872511_G/T 8:18872512_G/A 8:18872520_C/A 8:18872520_C/T 8:18872526_T/A 8:18872529_T/G 8:18872536_C/A 8:18872546_A/T 8:18872551_C/G 8:18872561_G/C 8:18872565_C/A 8:18872580_T/C 8:18872586_C/A 8:18872607_C/T 8:18872667_A/G 8:18872668_T/C 8:18872682_G/A 8:18872692_T/C 8:18872722_C/T 8:18872727_T/C 8:18872728_GAT/G 8:18872731_C/T 8:18936036_G/A 8:18936042_G/A 8:18936043_G/C 8:18936057_G/T 8:18936057_G/A 8:19013567_G/T 8:19013568_C/G 8:19013569_G/T 8:19013582_A/G

weizhouUMICH commented 4 years ago

Hi Ojami,

Sorry I forgot to put the note for the group file, which is the order of the variants in the group file needs to match the order of variants in the vcf file. If you put marker4 before marker2 and marker3 in the group file and marker4 appears after the two markers in the vcf file, only marker 1 and marker 4 will be tested and marker 2 and 3 will be ignored in the tests. GENE marker1 marker4 marker2 marker3 I've just put the note in the wiki. You may find out which markers are tested for the gene in the output for the gene-based tests.

For the group file GENE marker1 marker2 marker3_A marker4 marker3_B, marker3_B won't be tested.

Since any sample with missing genotypes in any of the markers tested in the gene will be dropped, the single-variant assoc tests for marker 1 will be different because more samples with missing genotypes for marker 2 and marker 3 are dropped, if those two markers are also tested in the gene.

Thanks for the marker list! Let me see what happened to those triallelic markers.

Thanks! Wei

weizhouUMICH commented 4 years ago

Hi Ojami,

I think I just figured out the reason causing the Error "Error in GratioMatrixall[1:m, 1:m] : subscript out of bounds" when you specified IsDropMissingDosages = TRUE. It is because after removing samples with missing genotypes, some markers in the gene became monomorphic. In the new version 0.36.3.2, this should have been fixed. Could you please give it a try? I don't think there is a problem in the code to handle triallelic markers. I've made example data to test the triallelic markers https://github.com/weizhouUMICH/SAIGE/blob/master/extdata/cmd.tri-allelic.sh The tests were successful.

Thanks! Wei

Ojami commented 4 years ago

Hi Wei,

Many thanks for the update. Yes, error was resolved! Also, no fractional part appears in AC anymore (IsDropMissingDosages = TRUE).

Best/Oveis

Ojami commented 4 years ago

Dear Wei,

There are two other (at least) situations in which the added update may cause another error.

  1. After removing missing variant calls, a rwo vector remains (instead of a matrix). For instance, if m is the sample size, m-1 samples have missing calls in at least one of the variants.

Removing 40386 samples with missing dosages/genotypes in the gene Error in base::colMeans(x, na.rm = na.rm, dims = dims, ...) : 'x' must be an array of at least two dimensions Calls: SPAGMMATtest -> colMeans -> colMeans ->

  1. All m samples have missing calls. In this case, the previous error Error in GratioMatrixall[1:m, 1:m] : subscript out of bounds happens.

Can you please also take a look at these? Of course there are some ways to tackle these errors before running SAIGE (from user perspective), and removing problematic genes before writing to groupFile.

Thanks/Oveis

weizhouUMICH commented 4 years ago

Hi Oveis,

Sorry for taking so long to get back to you. Thanks for testing the code out and figuring out the new issue! I've updated the code as 0.36.3.3. Would you mind trying the new version to see if the problem is solved?

Thanks! Wei

Ojami commented 4 years ago

Dear Wei,

Thanks for your update. I'm afraid that doesn't fix the issue. SAIGE now throws a new error:

Error in groupTest(Gmat = Gmat, obj.glmm.null = obj.glmm.null.sub, cateVarRatioMinMACVecExclude = cateVarRatioMinMACVecExclude, : object 'obj.glmm.null.sub' not found Calls: SPAGMMATtest ... groupTest -> system.time -> SAIGE_SKAT_withRatioVec

Isn't it better to remove these genes (as mentioned in my last comment) prior to SKAT? After all, rare variant association tests seem pointless for these genes.

Best/Oveis

SillySabertooth commented 4 years ago

Dear Wei, thanks for such a powerful tool!

I have a couple of questions, and I find to ask them in this topic more convenient.

  1. Is it right that always when I leave IsDropMissingDosages=false (default) I will have fractional part in AC? Since I have them both when using stand-alone and within the gene-based single-variant association tests.

  2. Could you add some information about how the imputing works here? Also I read about minInfo usage for filtering based on the imputation score. When i tried to use any numbers (0.000001..100) with this option to gene_based tests, i got "zero snps to analyse". For "stand_alone" single-variant association test I have got "1" in all rows in Imputationinfo column. Or i need to impute it by myself in order to filter that later, since "imputationInfo: imputation info. If not in dosage/genotype input file, will output 1"(c).?? Thus, I feel a bit confused in my attempts to figure out how it works.

  3. As I see, when i set IsSingleVarinGroupTest=TRUE & IsDropMissingDosages = TRUE in SAIGE-GENE, into a single-variant association tests fall those samples that have no missings for ALL markers in specific gene? This explains why I have different AF when I use "stand_alone" and "within gene_based tests", as well as p-values. It was not obvious for me.. meh.

  4. plink -assos make an assocation test for minor allele. I found in SAIGE-GENE output a line about "flipping to maf allele" the test for some snps, where ref allele was minor. But for single test, "Association results are with regard to Allele2."(c) And this allele is not a minor one.

here is one examples, one snp of interest bp | ref/alt | p.value.single_test | AF.Cases.single_test | AF.Controls.single_test | p.value.gene_based | AF.Cases.gene_based | AF.Controls.gene_based | a1/a2 | P | F_A | F_U chr1:169519049 | T/C | 0.4070926522054 | 0.959259259259259 | 0.983208758816718 | 0.9488943 | 0.96875 | 0.9818917 | C/T | 2.78e-05 | 0.04074 | 0.01679

p.s. I always used one dataset with genotypes for all steps: 270_exomes cases and 9525_microarray controls, that were beforehand analysis intersected and merged via bcftools (26k of markers).

Best, Danat