weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
187 stars 71 forks source link

Questions about fitNULLGLMM #92

Closed apoliakov closed 4 years ago

apoliakov commented 5 years ago

Hello and thank you for your work on SAIGE! Looks like I have 1 likely bug and 2 questions about the fitNULLGLMM step.

1. Possible bug: crash with a particular Plink file

I have a plink file (bed,bim,fam) for a dataset. The file has ~150,000 individuals and ~19,000,000 SNPs. Many of the SNPs have missing calls and some are multi-allelic. The multi-allelics are encoded as separate variants with "missing" genotypes for where the specified alternate does not apply.

I tried using this file directly in fitNULLGLMM (SAIGE version 0.35.8.1). It crashed R with an arithmetic error at this point:

M: 18973975, N: 151028
size of genoVecofPointers: 359
setgeno mark1
setgeno mark2
[crash: arithmetic exception in R with process exit]

I then made a smaller Plink file without any missing genotypes and that file worked fine. This suggests that the crash may have to do with missing genotypes. Or maybe the multi-allelics. I'd like to confirm the what the root cause is. Is this a known issue?

2. Question: what is a "good" Plink file?

To run this step accurately - assuming the numbers above - and since I'm making a custom file anyway - what would you recommend? Should I make a file with 10,000 SNPs or 100,000 SNPs or more? Should they be rare SNPs or frequent SNPs? I noticed that if I make a bigger file then this step takes longer. But does that make the model more accurate? Any guidance you can offer would be very helpful.

3. Question: what does "non-genetic covariate" mean?

Your manual mentions "non-genetic covariates" here: https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE#input-files What exactly does that mean? For example genetic principal components are often used as covariates in single-variant GWAS. Does this mean that we should not include the principal components?

Thanks again for your time!

weizhouUMICH commented 5 years ago

Copying my reply in the emails here, as it may help other users as well.

To fit the null logistic mixed model using fitNULLGLMM (step 1), ideally the number of independent markers for constructing GRM is larger than the sample size. In our previous GWAS for UKB, we used 93k markers that were pruned by UKB for kinship estimation while the sample size is ~400k (white British samples). The results look fine. We then compared the p-values from the previous analysis with 93k markers for GRM and with 340k markers for GRM for four phenotypes with different prevalences as shown in the Supplementary Figure 17 in the SAIGE paper https://www.ncbi.nlm.nih.gov/pubmed/30104761. We see for phenotypes such as CAD, the inflation of p-values was further corrected when 340k markers were used for GRM.

step 1 error is not due to missing genotypes, as SAIGE mean imputes the missing genotypes in the input plink file. It is due to the large sample size with very large number of markers (18 million). The plink genotypes are stored in a binary vector, which is chunked to smaller vectors, each requires a chunk of continuous memory. By default, each chunk will use 2Gb of continuous memory (memoryChunk=2). According to the log file, 359 chunks were used for the data.

And using 18 million markers are not necessary for the step 1. The computation cost for step 1 is linear to the number of markers used for GRM. I would recommend using pruned markers and the number should be larger than the sample size. At least 100 markers for each MAC category in MAC = 1, 2, 3, 4, 5, 6-10, 10-20 are needed in the plink file for GRM, so they can be used to estimate the variance ratio (second part in step 1) later for testing ultra-rare variants in gene-based tests.

GRM is always needed for step 1. For gene-based tests, we use both a full GRM and a sparse GRM for variance ratio estimation, while for single-variant association tests, we only use a full GRM for variance ratio estimation (second part in step 1). The sparse GRM can be constructed using the step 0 script.

Non-genetic covariates mean variables to put in the null model, such as gender, age, PCs.

apoliakov commented 4 years ago

@weizhouUMICH it's been a while but we've been using your guidance - thank you.

We've had to do this a few times now for different cases :)

I think it might be a good idea, in the future, to have SAIGE automatically do these steps. That is

  1. I tell SAIGE to do fitNULLGLMM and give it a large plink file. Let it be 18M markers if needed. Maybe it's 90M markers. Whatever.
  2. SAIGE automatically reads the SNPS it needs from the file (maybe it only needs 50K, so it does that) and it automatically finds the MAC levels and so on.
  3. SAIGE just outputs the GRM.
  4. Maybe there's a flag on SAIGE so I can tell it "use 100K markers" or use "400K markers" because the latter is definitely slower. But that's it.

And then I don't have to worry about the rest. :)

I originally used 18M markers because I imagined SAIGE would do that for me automatically :) But I imagined wrong.

Because right now I have to generate lots of plink files for GRM (different subpopulations, cohorts and so on). And then I also have to generate separate bgen files for the calculation itself. The plinks and the bgen files have a lot of the same data too. That's a little messy. So also (and that's a separate issue) having step 2 use plink format would be great.

Just as a thought... Thanks again!

weizhouUMICH commented 4 years ago

Hi @apoliakov,

Thanks for the suggestions! Since the version 0.35.8.8, an argument minMAFforGRM can be used to specify the minimum MAF for markers in the plink file to be used for GRM. Because for data sets with large sample sizes, it is not feasible to store the GRM in the memory, instead of storing GRM, SAIGE computes elements in GRM when they are needed in the algorithm. Therefore, SAIGE can't output the full GRM. It output the sparse GRM for gene- or region-based tests in step 1. We will consider developing an option in step 2 to take plink files as an input.

Thanks, Wei

apoliakov commented 4 years ago

Ok that's interesting. I have a couple of thoughts

1. minMAFforGRM

I would recommend using pruned markers and the number should be larger than the sample size. At least 100 markers for each MAC category in MAC = 1, 2, 3, 4, 5, 6-10, 10-20 are needed in the plink file for GRM

So isn't the recommended minMAFforGRM always 1/2N ? It's not clear when we'd use it...

2. Dealing with the large files

What I was trying to say is - there is a "built-in assumption" that if I provide a plink file for step 1 then all SNPs for that file will be used for GRM. But why does that have to be true? Why can't you have a set up where user gives you a file with 10000 SNPs and you have some code that figures out "I only need 5000 SNPs to make a 'good GRM'" and then you only read 5000 into memory?

Put another way, I suggest to separate "the size of the file" from "the size of the GRM." And figure out "the size of the GRM" automatically regardless of how big the file is.

Yes - it means that SAIGE has to do more work. But, to be honest, no one I talked to really had an understanding of "how many SNPs" to use and "which SNPs to use" and so on. People need guidance on how to use the tool. And the best piece of guidance we have is your comment above! So why not take that "knowledge" and encode it into SAIGE itself?