weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
188 stars 73 forks source link

Genotypes and dosages in input #338

Closed danimatias93 closed 2 years ago

danimatias93 commented 3 years ago

Hi Wei, First of all thank you for providing the community with this tool, I am finding it very useful in my analysis.

I have two questions regarding SAIGE.

1) We are working with imputed genotype probabilities from IMPUTE2, and we are wondering which could be the best approach to produce the input files for SAIGE. Is a hard-calling step always necessary (e.g., from genotype probabilities to genotypes, or dosages expressed as integers)? What happens if we provide .bgen files?

2) Is additivity always assumed in association testing? If so, how is this accomodated for the X chromosome (e.g., are males encoded as 0/1 or 0/2)?

Thank you

weizhouUMICH commented 3 years ago

Hi @danimatias93,

  1. SAIGE (step 2 for association tests) can take the bgen which contains the genotype probabilities as an input. In Step 1, a plink file containing markers that are used for estimating the genetic relationship matrix is required. These markers need to be LD pruned and the number of markers is larger than the number of "independent samples" to capture all sample relatedness in the data. Therefore, we usually include the hard-called markers in the plink file for Step 1.
  2. For chromosome X, if males are encoded as 0/1 in the dosage file but would like SAIGE to recode them to 0/2 in the association tests, the argument --is_rewrite_XnonPAR_forMales can be set to TRUE and then use --X_PARregion to specify the regions for the recoding and --sampleFile_male to specify the male IDs.
    https://github.com/weizhouUMICH/SAIGE/blob/master/extdata/step2_SPAtests.R#L120

Thanks, Wei

danimatias93 commented 3 years ago

Hi Wei,

Thanks so much for your nice and quick reply. I’d still need a couple of clarifications though. 1) If we provide genotype probabilities (bgen) for step2, would SAIGE perform the hard-calling of genotypes as it is doing when we use vcf format or would it consider the probabilites for this analysis? 2) I understand that the 0/1 encoding is the standard for males for the X chromosome. And what about females? If we provide the bgen files as input, will the females be encoded as 0/0.5/1 or 0/1/2?

Thank you again for your time, Dani

danimatias93 commented 3 years ago

Hi Wei,

Thank you again for your first reply,

Sorry if I insist but we are still waiting to understand the issues you can see above to continue working with SAIGE. As you can see we are especially interested in how it behaves on the X chromosome, the differences between using genotype probabilities or hard-calling in the association step and how genotypes are encoded on this chromosome.

Thanks again,

Dani

weizhouUMICH commented 2 years ago

Sorry fo the late reply! We have just released a new version 1.0.0. It has substantial computational efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests and clearer log output. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.

Thanks! Wei