weizhouUMICH / SAIGE

GNU Lesser General Public License v3.0
187 stars 72 forks source link

step1_fitNULLGLMM: plink files from TOPMED imputation || Error for bad alloc #351

Closed complexgenome closed 2 years ago

complexgenome commented 3 years ago

Hi @weizhouUMICH

I have plink file for autosomal CHRs with 42,788,581 SNPs. These are high quality SNPs from TOPMED imputation output. I try to run null model using these data I get error for

Call:  glm(formula = formula.new, family = binomial, data = data.new)

Coefficients:
  minus1       SEX       PC1       PC2       PC3       AGE
 2.86446  -0.12464  -0.15576   0.12895   0.02129   0.84851

Degrees of Freedom: 7127 Total (i.e. Null);  7121 Residual
Null Deviance:      9880
Residual Deviance: 3134         AIC: 3146
Start fitting the NULL GLMM
   user  system elapsed
  4.904   3.684   6.593
   user  system elapsed
  6.216   5.480   6.849
[1] "Start reading genotype plink file here"
nbyte: 1809
nbyte: 1782
reserve: 76334825472

M: 42788581, N: 7234
size of genoVecofPointers: 39
bad_alloc caught1: std::bad_alloc

I would like to run gene-based analysis, where saige needs to fit model using different MAC as categorical variables. Size of PLINK .bim and .bed file is ~75G.

I run analysis on node requesting with 50G memory.


step1_fitNULLGLMM.R --plinkFile=CHR_all_MHAS_RSQ80 
 --phenoFile=phenotype.txt --phenoCol=PHENO_FINAL --sexCol=SEX --covarColList=SEX,PC1,PC2,PC3,AGE --sampleIDColinphenoFile=TOPMED_IID 
 --traitType=binary --invNormalize=FALSE --outputPrefix=OUTPUT_GENEBASED/MHAS_null --outputPrefix_varRatio=OUTPUT_GENEBASED/variance_MHAS 
 --sparseGRMFile=sparse_matrix/sparseGRM_MHAS_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx --sparseGRMSampleIDFile=sparse_matrix/sparseGRM_MHAS_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt --nThreads=4 --LOCO=FALSE --skipModelFitting=FALSE --IsSparseKin=TRUE 
 --IsOverwriteVarianceRatioFile=TRUE --isCateVarianceRatio=TRUE --MaleCode=0 --FemaleCode=1

Is there a way I can resolve this? Can I limit SNPs until say, 5% MAF only for this step?

best,

complexgenome commented 3 years ago

Hi @weizhouUMICH

Sorry to be a nag on this. Can you please help here?

weizhouUMICH commented 3 years ago

Hi @Sanjeev,

For step 1, you don;t need all imputed markers. The markers in Step 1 are for constructing GRM and the number of markers only needs to be larger than the independent samples in the data set. Please check the question 4 of FAQ on this page https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE

https://github.com/weizhouUMICH/SAIGE/wiki/Genetic-association-tests-using-SAIGE Thanks, Wei

On Tue, Jun 15, 2021 at 5:26 PM Sanjeev @.***> wrote:

Hi @weizhouUMICH https://github.com/weizhouUMICH

Sorry to be a nag on this. Can you please look into this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/weizhouUMICH/SAIGE/issues/351#issuecomment-861844867, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACL52LZ3RHBYUN3RZQWMFSDTS7AQRANCNFSM46IIIU5A .

complexgenome commented 3 years ago

@weizhouUMICH

Thank you. I understand for the GRM small number of markers are fine. However, I am fitting null model for gene-based tests. it needs SNPs (30 randomly) for different MACs (1-20).

--isCateVarianceRatio=TRUE

I got this from your reply at:

https://github.com/weizhouUMICH/SAIGE/issues/226#issuecomment-662220726

complexgenome commented 3 years ago

Hi @weizhouUMICH

PLINK data don't have reliable SNPs with lower allele frequency.

I extract TOPMED imputation imputed SNPs (80% RSQ, MAF <=1%). The number of SNPs is ~35 million. When I try to construct kinship I get error for the bad alloc. I am using node with 65G memory.

nbyte: 1809
nbyte: 1782
reserve: 77072023552

M: 43201809, N: 7234
size of genoVecofPointers: 40
bad_alloc caught1: std::bad_alloc

Can you please guide on on how to select rare markers for gene-based association test model fitting?

weizhouUMICH commented 2 years ago

Sorry fo the late reply! We have just released a new version 1.0.0. It has substantial computational efficiency improvements for both Step 1 and Step 2 for single-variant and set-based tests and clearer log output. We have created a new program github page https://github.com/saigegit/SAIGE with the documentation provided https://saigegit.github.io/SAIGE-doc/ The program will be maintained by multiple SAIGE developers there. The docker image has been updated. Please feel free to try the version 1.0.0 and report issues if any.

Thanks! Wei