zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

PCs are all NA when number of samples >50K #69

Closed jjfarrell closed 3 years ago

jjfarrell commented 3 years ago

We are getting an error when our sample size goes about 50,000. The PC-AIR step results in just NAs for the PCS. At 45k samples, it runs fine. Any suggestions on this issue?

PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10    sample.id
-0.0120138984629118     -0.00155873339950702    -0.00110230516591038    -9.77996252854696e-05   0.000395412453209808    -0.00152135811291193    -1.45423707942342e-05   -0.00201312188579735    -0.00394751800943769    -0.00282981157052519    08AD09144_NACC022453
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD8245_NACC657970
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD9080_NACC594499
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD9797
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD10457
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD10987
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11073
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11218
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11219

Here is the code...

kingMat <- readRDS("adgc.kingMat.rds")

sampleset <- pcairPartition(kinobj =kingMat,divobj =kingMat)
length(sampleset$rels)
length(sampleset$unrels)

#pca on unrels
pca.unrel <- snpgdsPCA(gds_obj, sample.id=sampleset$unrels,num.thread = coreNum)
saveRDS(pca.unrel, "pca.unrel.rds")
# project values for relatives
snp.load  <- snpgdsPCASNPLoading(pca.unrel, gdsobj=gds_obj,num.thread = coreNum)
samp.load <- snpgdsPCASampLoading(snp.load, gdsobj=gds_obj, sample.id=sampleset$rels,num.thread = coreNum)

Here is the session info:

R```
 version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /share/pkg.7/r/4.0.0/install/lib/libopenblasp-r0.3.7.so
LAPACK: /share/pkg.7/r/4.0.0/install/lib/liblapack.so.3.9.0

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base    

other attached packages:
[1] GENESIS_2.21.4      GWASTools_1.36.0    Biobase_2.47.3    
[4] BiocGenerics_0.33.3 SNPRelate_1.24.0    gdsfmt_1.23.13    

loaded via a namespace (and not attached):
 [1] zoo_1.8-8              tidyselect_1.1.0       purrr_0.3.4          
 [4] DNAcopy_1.61.0         splines_4.0.0          lattice_0.20-41      
 [7] vctrs_0.3.0            generics_0.0.2         stats4_4.0.0          
[10] GWASExactHW_1.01       mgcv_1.8-31            blob_1.2.1            
[13] survival_3.1-12        rlang_0.4.6            pillar_1.4.4          
[16] glue_1.4.1             DBI_1.1.0              bit64_0.9-7          
[19] GenomeInfoDbData_1.2.3 foreach_1.5.0          lifecycle_0.2.0      
[22] zlibbioc_1.33.1        Biostrings_2.55.7      MatrixModels_0.4-1    
[25] codetools_0.2-16       memoise_1.1.0          SeqArray_1.27.18      
[28] IRanges_2.21.8         SparseM_1.78           GenomeInfoDb_1.23.17  
[31] lmtest_0.9-37          quantreg_5.55          broom_0.5.6          
[34] Rcpp_1.0.4.6           backports_1.1.7        quantsmooth_1.53.0    
[37] S4Vectors_0.25.15      XVector_0.27.2         bit_1.1-15.2          
[40] digest_0.6.25          dplyr_1.0.0            GenomicRanges_1.39.3  
[43] grid_4.0.0             bitops_1.0-6           sandwich_2.5-1        
[46] magrittr_1.5           RCurl_1.98-1.2         tibble_3.0.1          
[49] RSQLite_2.2.0          mice_3.9.0             crayon_1.3.4          
[52] tidyr_1.1.0            pkgconfig_2.0.3        ellipsis_0.3.1        
[55] Matrix_1.2-18          data.table_1.14.0      logistf_1.23          
[58] iterators_1.0.12       SeqVarTools_1.25.2     R6_2.4.1              
[61] nlme_3.1-148           compiler_4.0.0   

Here is the log, this time the script run through but the PCs results are NaNs: /restricted/projectnb/adgc/zhucc/ADGC_data/pheno/PCs/script_53k_twoStepsPCS/adgc.pc-air.pcs.txt

Working with 52877 samples
Identifying relatives for each sample using kinship threshold 0.0220970869120796
Identifying pairs of divergent samples using divergence threshold -0.0220970869120796
Partitioning samples into unrelated and related sets...
    ...1000 samples added to related.set...
    ...2000 samples added to related.set...
    ...3000 samples added to related.set...
[1] 3721
[1] 49156
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP on non-autosomes
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 1857235516
CPU capabilities: Double-Precision SSE2
Fri Apr 16 22:45:20 2021    (internal increment: 64)
[==================================================] 100%, completed, 43.8m
Fri Apr 16 23:29:14 2021    Begin (eigenvalues and eigenvectors)
Sat Apr 17 10:09:42 2021    Done.
SNP Loading:
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
SNP Loading:    the sum of all selected genotypes (0,1,2) = 1857235516
Sat Apr 17 10:09:50 2021    (internal increment: 444)
[==================================================] 100%, completed, 7s  
Sat Apr 17 10:09:57 2021    Done.
Sample Loading:
    # of samples: 3,721
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
Sample Loading:    the sum of all selected genotypes (0,1,2) = 140973579
Sat Apr 17 10:10:01 2021    (internal increment: 5908)
[==================================================] 100%, completed, 4s  
Sat Apr 17 10:10:05 2021    Done.
jjfarrell commented 3 years ago

Will resubmit on SNPRelate.