xinhe-lab / mapgen

R package to perform gene mapping using functionally-informed genetic fine mapping
https://xinhe-lab.github.io/mapgen/
Other
3 stars 2 forks source link

Input XtX matrix contains NAs #8

Closed 1667857557 closed 5 months ago

1667857557 commented 8 months ago

Hi Dr.Luo

When we using the reference panel generated from UK10K with the GWAS summary dataset, we encountered a problem during the run_finemapping step. However, the reference panel from 1000G performed well with this GWAS dataset. The error report is listed below. Could you provide advice on how to resolve this issue? Thank you in advance!

Huang

> susie.res <- run_finemapping(sumstats = gwas.sumstats, 
+                              bigSNP = bigSNP, 
+                              priortype = 'uniform', 
+                              n = 24009,
+                              L = 1)
Finemapping locus 1...
Run susie_rss...
WARNING: XtX is not symmetric; forcing XtX to be symmetric by replacing XtX with (XtX + t(XtX))/2
Error in susie_suff_stat(XtX = XtX, Xty = Xty, n = n, yty = (n - 1) *  : 
  Input XtX matrix contains NAs
kevinlkx commented 8 months ago

I suspect that there are mismatches between your GWAS summary statistics and UK10K reference. Did you run the data preparation steps (process_gwas_sumstats() in particular) using bigSNP object generated from UK10K reference? Can you check to see if you have NAs in your sumstats$bigSNP_index? and do you have NAs in the X if you run the line below? X <- bigSNP$genotypes[, sumstats$bigSNP_index]

1667857557 commented 8 months ago

Hi Dr.Luo

Thanks for your prompt response! Below is the report from our analysis run. There are many missing values (NAs) present in our bigSNP file. This is strange because the UK10K panel we utilized was generated by merging the European (EUR) population panel from the 1000G Phase 3 dataset and UK10K (ALSPAC and TWINSUK), comprising 4,285 individuals. Additionally, the bed file size is large at 50GB because we filtered the genotypic missing calls (--geno 0.05) from each panel before merging and kept the low minor allele frequency (MAF) SNPs from the panel. Do you think this process will affect the result?

Huang

> bigSNP <- snp_attach(rdsfile = 'D:/1kg.v3/UK10K_1KG.rds')
> gwas.sumstats <- process_gwas_sumstats(A, 
+                                        chr='CHR', 
+                                        pos='POS', 
+                                        beta='BETA', 
+                                        se='SE',
+                                        a0='other_allele', 
+                                        a1='effect_allele', 
+                                        snp='SNP', 
+                                        pval='P',
+                                        LD_Blocks=LD_blocks,
+                                        bigSNP=bigSNP)
Cleaning summary statistics...
Assigning GWAS SNPs to LD blocks...
Matching GWAS with bigSNP reference panel...
6,782,052 variants to be matched.
1,020,766 ambiguous SNPs have been removed.
Some duplicates were removed.
5,750,494 variants have been matched; 0 were flipped and 4,553,380 were reversed.
> library(bigsnpr)
> susie.res <- run_finemapping(sumstats = gwas.sumstats, 
+                              bigSNP = bigSNP, 
+                              priortype = 'uniform', 
+                              n = 24009,
+                              L = 1)
Finemapping locus 1...
Run susie_rss...
WARNING: XtX is not symmetric; forcing XtX to be symmetric by replacing XtX with (XtX + t(XtX))/2
Error in susie_suff_stat(XtX = XtX, Xty = Xty, n = n, yty = (n - 1) *  : 
  Input XtX matrix contains NAs

> X <- bigSNP$genotypes[, gwas.sumstats$bigSNP_index]
> na_count <- sum(is.na(X))
> na_count
[1] 328949726
kevinlkx commented 8 months ago

Thanks for your information. I think the missing values (NAs) in your bigSNP file caused the problem. The run_finemapping() function (provided with bigSNP object) uses the genotype data in bigSNP to compute LD (R) matrices. But our current version doesn't allow missing values in genotype data. The bigsnpr package that we used also does not work with missing values, but they provide functions for imputing missing values of genotyped variants (see https://github.com/privefl/bigsnpr). Or you could try other tools for genotype imputation.