zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
98 stars 25 forks source link

PC-AIR issue when over 50k samples. #86

Open jjfarrell opened 3 years ago

jjfarrell commented 3 years ago

We are getting an error when our sample size goes above 50,000. The PC-AIR step results in just NAs for all the PCS. At 45k samples, it runs fine. Originally reported this issue to GENESIS (https://github.com/UW-GAC/GENESIS/issues/64) Any suggestions on this issue?

PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9     PC10    sample.id
-0.0120138984629118     -0.00155873339950702    -0.00110230516591038    -9.77996252854696e-05   0.000395412453209808    -0.00152135811291193    -1.45423707942342e-05   -0.00201312188579735    -0.00394751800943769    -0.00282981157052519    08AD09144_NACC022453
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD8245_NACC657970
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD9080_NACC594499
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD9797
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD10457
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD10987
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11073
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11218
NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      08AD11219

Here is the code...

kingMat <- readRDS("adgc.kingMat.rds")

sampleset <- pcairPartition(kinobj =kingMat,divobj =kingMat)
length(sampleset$rels)
length(sampleset$unrels)

#pca on unrels
pca.unrel <- snpgdsPCA(gds_obj, sample.id=sampleset$unrels,num.thread = coreNum)
saveRDS(pca.unrel, "pca.unrel.rds")
# project values for relatives
snp.load  <- snpgdsPCASNPLoading(pca.unrel, gdsobj=gds_obj,num.thread = coreNum)
samp.load <- snpgdsPCASampLoading(snp.load, gdsobj=gds_obj, sample.id=sampleset$rels,num.thread = coreNum)

Here is the session info:

R```
 version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /share/pkg.7/r/4.0.0/install/lib/libopenblasp-r0.3.7.so
LAPACK: /share/pkg.7/r/4.0.0/install/lib/liblapack.so.3.9.0

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base    

other attached packages:
[1] GENESIS_2.21.4      GWASTools_1.36.0    Biobase_2.47.3    
[4] BiocGenerics_0.33.3 SNPRelate_1.24.0    gdsfmt_1.23.13    

loaded via a namespace (and not attached):
 [1] zoo_1.8-8              tidyselect_1.1.0       purrr_0.3.4          
 [4] DNAcopy_1.61.0         splines_4.0.0          lattice_0.20-41      
 [7] vctrs_0.3.0            generics_0.0.2         stats4_4.0.0          
[10] GWASExactHW_1.01       mgcv_1.8-31            blob_1.2.1            
[13] survival_3.1-12        rlang_0.4.6            pillar_1.4.4          
[16] glue_1.4.1             DBI_1.1.0              bit64_0.9-7          
[19] GenomeInfoDbData_1.2.3 foreach_1.5.0          lifecycle_0.2.0      
[22] zlibbioc_1.33.1        Biostrings_2.55.7      MatrixModels_0.4-1    
[25] codetools_0.2-16       memoise_1.1.0          SeqArray_1.27.18      
[28] IRanges_2.21.8         SparseM_1.78           GenomeInfoDb_1.23.17  
[31] lmtest_0.9-37          quantreg_5.55          broom_0.5.6          
[34] Rcpp_1.0.4.6           backports_1.1.7        quantsmooth_1.53.0    
[37] S4Vectors_0.25.15      XVector_0.27.2         bit_1.1-15.2          
[40] digest_0.6.25          dplyr_1.0.0            GenomicRanges_1.39.3  
[43] grid_4.0.0             bitops_1.0-6           sandwich_2.5-1        
[46] magrittr_1.5           RCurl_1.98-1.2         tibble_3.0.1          
[49] RSQLite_2.2.0          mice_3.9.0             crayon_1.3.4          
[52] tidyr_1.1.0            pkgconfig_2.0.3        ellipsis_0.3.1        
[55] Matrix_1.2-18          data.table_1.14.0      logistf_1.23          
[58] iterators_1.0.12       SeqVarTools_1.25.2     R6_2.4.1              
[61] nlme_3.1-148           compiler_4.0.0   

Here is the log, this time the script run through but the PCs results are NaNs: /restricted/projectnb/adgc/zhucc/ADGC_data/pheno/PCs/script_53k_twoStepsPCS/adgc.pc-air.pcs.txt

Working with 52877 samples
Identifying relatives for each sample using kinship threshold 0.0220970869120796
Identifying pairs of divergent samples using divergence threshold -0.0220970869120796
Partitioning samples into unrelated and related sets...
    ...1000 samples added to related.set...
    ...2000 samples added to related.set...
    ...3000 samples added to related.set...
[1] 3721
[1] 49156
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP on non-autosomes
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 1857235516
CPU capabilities: Double-Precision SSE2
Fri Apr 16 22:45:20 2021    (internal increment: 64)
[==================================================] 100%, completed, 43.8m
Fri Apr 16 23:29:14 2021    Begin (eigenvalues and eigenvectors)
Sat Apr 17 10:09:42 2021    Done.
SNP Loading:
    # of samples: 49,156
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
SNP Loading:    the sum of all selected genotypes (0,1,2) = 1857235516
Sat Apr 17 10:09:50 2021    (internal increment: 444)
[==================================================] 100%, completed, 7s  
Sat Apr 17 10:09:57 2021    Done.
Sample Loading:
    # of samples: 3,721
    # of SNPs: 87,468
    using 28 threads
    using the top 32 eigenvectors
Sample Loading:    the sum of all selected genotypes (0,1,2) = 140973579
Sat Apr 17 10:10:01 2021    (internal increment: 5908)
[==================================================] 100%, completed, 4s  
Sat Apr 17 10:10:05 2021    Done.
jjfarrell commented 3 years ago

On this run, we got a segfault memory error.

Working with 52877 samples Identifying relatives for each sample using kinship threshold 0.0220970869120796 Identifying pairs of divergent samples using divergence threshold -0.0220970869120796 Partitioning samples into unrelated and related sets... ...1000 samples added to related.set... ...2000 samples added to related.set... ...3000 samples added to related.set... [1] 3721 [1] 49156 Principal Component Analysis (PCA) on genotypes: Calculating allele counts/frequencies ... [==================================================] 100%, completed, 2s (process 1)

of selected variants: 87,468

# of samples: 49,156
# of SNVs: 87,468
using 28 threads
# of principal components: 3

CPU capabilities: Double-Precision SSE2 Thu Apr 29 13:05:20 2021 (internal increment: 64) [==================================================] 100%, completed, 42.5m Thu Apr 29 13:47:57 2021 Begin (eigenvalues and eigenvectors)

caught segfault address 0x18d8cb0, cause 'memory not mapped'

Traceback: 1: snpgdsPCA(gds_obj, sample.id =sampleset$unrels, num.thread = coreNum, eigen.cnt = pcNum) An irrecoverable exception occurred. R is aborting now ... /var/spool/sge/scc-gh1/job_scripts/6266830: line 7: 36415 Segmentation fault Rscript --vanilla $1

zhengxwen commented 3 years ago

Did you check the memory usage? Ran out of memory?

jjfarrell commented 3 years ago

We are running with 196GB of memory with 28  cores.  What is a recommended amount of memory per core be?  

jjfarrell commented 3 years ago

We were able to run the PCs on 45k unrelated and then project it on the remaining 8k samples successfully.

GraceSheng commented 3 years ago

I encountered similar issues as jjfarrell mentioned but haven’t found a solution yet. I was running 76K samples for 33,765 SNPs, using PLINK bed/bim/fam files. I was able to get the PCs for 45K samples before by using R 3.6.3/GENESIS_2.16.1. When the sample size goes up to 76K, I got “segfault” error. After reading this post, I upgraded both R and GENESIS to the latest versions. I did two runs:

  1. using R 4.1.0/GENESIS_2.22.1, 40 cores, 1T mem, the PCair gave NAs for principal components except for the 1st sample row which has values. The eigenvalue file looks fine.
  2. using R 4.1.0/GENESIS_2.22.1, 12 cores, 1T mem, added “export OMP_NUM_THREADS=1” in the shell script before running the PCair/PCrelate R script, the PCair gave “segfault”

session info:

R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /spack/apps/linux-centos7-x86_64/gcc-8.3.0/openblas-0.3.8-2no6mfziiclwxb7lstxoos335gnhjpes/lib/libopenblasp-r0.3.8.so

locale: [1] C

attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base

other attached packages: [1] GWASTools_1.38.0 Biobase_2.52.0 BiocGenerics_0.38.0 [4] SNPRelate_1.26.0 gdsfmt_1.28.0 GENESIS_2.22.1 [7] dplyr_1.0.6 data.table_1.14.0

loaded via a namespace (and not attached): [1] quantsmooth_1.58.0 Rcpp_1.0.6 lattice_0.20-44 [4] tidyr_1.1.3 Biostrings_2.60.1 formula.tools_1.7.1 [7] zoo_1.8-9 assertthat_0.2.1 lmtest_0.9-38 [10] foreach_1.5.1 utf8_1.2.1 R6_2.5.0 [13] GenomeInfoDb_1.28.0 backports_1.2.1 MatrixModels_0.5-0 [16] stats4_4.1.0 RSQLite_2.2.7 pillar_1.6.1 [19] zlibbioc_1.38.0 rlang_0.4.11 rstudioapi_0.13 [22] SparseM_1.81 blob_1.2.1 S4Vectors_0.30.0 [25] Matrix_1.3-4 splines_4.1.0 GWASExactHW_1.01 [28] RCurl_1.98-1.3 bit_4.0.4 broom_0.7.6 [31] compiler_4.1.0 pkgconfig_2.0.3 mgcv_1.8-36 [34] tidyselect_1.1.1 tibble_3.1.2 GenomeInfoDbData_1.2.6 [37] DNAcopy_1.66.0 IRanges_2.26.0 codetools_0.2-18 [40] matrixStats_0.59.0 fansi_0.5.0 crayon_1.4.1 [43] conquer_1.0.2 bitops_1.0-7 grid_4.1.0 [46] nlme_3.1-152 lifecycle_1.0.0 DBI_1.1.1 [49] magrittr_2.0.1 SeqVarTools_1.30.0 cachem_1.0.5 [52] XVector_0.32.0 mice_3.13.0 ellipsis_0.3.2 [55] generics_0.1.0 vctrs_0.3.8 sandwich_3.0-1 [58] iterators_1.0.13 tools_4.1.0 bit64_4.0.5 [61] glue_1.4.2 purrr_0.3.4 fastmap_1.1.0 [64] survival_3.2-11 SeqArray_1.32.0 GenomicRanges_1.44.0 [67] operator.tools_1.6.3 memoise_2.0.0 logistf_1.24 [70] quantreg_5.86

PCair script is something like the following:

geno = GdsGenotypeReader(filename = paste(PLINKFilePrefix,".gds",sep="")) genoData = GenotypeData(geno) print("start PCair 1") pcair1 = pcair(genoData, num.cores=40, kinobj = KINGmat, divobj = KINGmat) print("end PCair 1")

################################## Run 1: using R 4.1.0/GENESIS_2.22.1, 40 cores, 1T mem, the PCair gave NAs for principal components except for the 1st sample ################################## ====== start run_PCair_PCrelate_1st_iter_V1_largemem.R ======

Attaching package: 'dplyr' The following objects are masked from 'package:data.table': between, first, last The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union

Loading required package: gdsfmt SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2) Loading required package: Biobase Loading required package: BiocGenerics Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:dplyr':

combine, intersect, setdiff, union

The following objects are masked from 'package:stats':

IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colnames, dirname, do.call,
duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
tapply, union, unique, unsplit, which.max, which.min

Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

[1] "workDir = ./" [1] "PLINKFilePrefix = test_76K_num" [1] "kinMatrixFileName = test_76K_num.kc.matrix.txt" [1] "iteration = 1" [1] "KINGmat: 76000 x 76000" Start file conversion from PLINK BED to SNP GDS ... BED file: 'test_76K_num.bed' SNP-major mode (Sample X SNP), 611.8M FAM file: 'test_76K_num.fam' BIM file: 'test_76K_num.bim' Tue Jun 8 19:28:07 2021 (store sample id, snp id, position, and chromosome) start writing: 76000 samples, 33765 SNPs ...

[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 36s Tue Jun 8 19:28:43 2021 Done. Optimize the access efficiency ... Clean up the fragments of GDS file: open the file 'test_76K_num.gds' (612.1M)

of fragments: 39

save to 'test_76K_num.gds.tmp'
rename 'test_76K_num.gds.tmp' (612.1M, reduced: 252B)
# of fragments: 18

[1] "start PCair 1" Using kinobj and divobj to partition samples into unrelated and related sets Working with 76000 samples Identifying relatives for each sample using kinship threshold 0.0220970869120796 Identifying pairs of divergent samples using divergence threshold -0.0220970869120796 Partitioning samples into unrelated and related sets... ...1000 samples added to related.set... ...2000 samples added to related.set... ...3000 samples added to related.set... ...4000 samples added to related.set... ...5000 samples added to related.set... ...6000 samples added to related.set... ...7000 samples added to related.set... ...8000 samples added to related.set... ...9000 samples added to related.set... ...10000 samples added to related.set... Unrelated Set: 65135 Samples Related Set: 10865 Samples Performing PCA on the Unrelated Set... Principal Component Analysis (PCA) on genotypes: Excluding 0 SNP on non-autosomes Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)

of samples: 65,135

# of SNPs: 33,765
using 40 threads
# of principal components: 32

PCA: the sum of all selected genotypes (0,1,2) = 1275566990 CPU capabilities: Double-Precision SSE2 Tue Jun 8 20:03:23 2021 (internal increment: 64) [..................................................] 0%, ETC: --- [=>................................................] 1%, ETC: 1.7h [=>................................................] 2%, ETC: 1.5h [==>...............................................] 3%, ETC: 1.4h [==>...............................................] 4%, ETC: 1.3h [===>..............................................] 5%, ETC: 1.2h [===>..............................................] 6%, ETC: 1.2h [====>.............................................] 7%, ETC: 1.1h [====>.............................................] 8%, ETC: 1.1h [=====>............................................] 9%, ETC: 1.1h [=====>............................................] 10%, ETC: 1.0h [======>...........................................] 11%, ETC: 1.0h [======>...........................................] 12%, ETC: 59.9m [=======>..........................................] 13%, ETC: 58.7m [=======>..........................................] 14%, ETC: 57.8m [========>.........................................] 15%, ETC: 56.7m [========>.........................................] 16%, ETC: 55.5m [=========>........................................] 17%, ETC: 54.5m [=========>........................................] 18%, ETC: 53.8m [==========>.......................................] 19%, ETC: 52.9m [==========>.......................................] 20%, ETC: 52.2m [===========>......................................] 21%, ETC: 49.7m [===========>......................................] 22%, ETC: 48.2m [============>.....................................] 23%, ETC: 47.1m [============>.....................................] 24%, ETC: 46.4m [=============>....................................] 25%, ETC: 46.1m [=============>....................................] 26%, ETC: 45.5m [==============>...................................] 27%, ETC: 44.7m [==============>...................................] 28%, ETC: 44.2m [===============>..................................] 29%, ETC: 43.3m [===============>..................................] 30%, ETC: 42.8m [================>.................................] 31%, ETC: 42.1m [================>.................................] 32%, ETC: 41.3m [=================>................................] 33%, ETC: 40.9m [=================>................................] 34%, ETC: 40.4m [==================>...............................] 35%, ETC: 39.6m [==================>...............................] 36%, ETC: 39.4m [===================>..............................] 37%, ETC: 38.7m [===================>..............................] 38%, ETC: 37.9m [====================>.............................] 39%, ETC: 37.3m [====================>.............................] 40%, ETC: 36.5m [=====================>............................] 41%, ETC: 36.0m [=====================>............................] 42%, ETC: 35.6m [======================>...........................] 43%, ETC: 35.1m [======================>...........................] 44%, ETC: 34.2m [=======================>..........................] 45%, ETC: 33.4m [=======================>..........................] 46%, ETC: 32.8m [========================>.........................] 47%, ETC: 32.4m [========================>.........................] 48%, ETC: 31.8m [=========================>........................] 49%, ETC: 31.3m [=========================>........................] 50%, ETC: 30.8m [==========================>.......................] 51%, ETC: 30.2m [==========================>.......................] 52%, ETC: 29.6m [===========================>......................] 53%, ETC: 28.7m [===========================>......................] 54%, ETC: 28.2m [============================>.....................] 55%, ETC: 27.6m [============================>.....................] 56%, ETC: 26.9m [=============================>....................] 57%, ETC: 26.4m [=============================>....................] 58%, ETC: 26.0m [==============================>...................] 59%, ETC: 25.2m [==============================>...................] 60%, ETC: 24.6m [===============================>..................] 61%, ETC: 24.0m [===============================>..................] 62%, ETC: 23.3m [================================>.................] 63%, ETC: 22.7m [================================>.................] 64%, ETC: 22.1m [=================================>................] 65%, ETC: 21.5m [=================================>................] 66%, ETC: 20.8m [==================================>...............] 67%, ETC: 20.2m [==================================>...............] 68%, ETC: 19.6m [===================================>..............] 69%, ETC: 18.9m [===================================>..............] 70%, ETC: 18.4m [====================================>.............] 71%, ETC: 17.8m [====================================>.............] 72%, ETC: 17.2m [=====================================>............] 73%, ETC: 16.5m [=====================================>............] 74%, ETC: 15.9m [======================================>...........] 75%, ETC: 15.3m [======================================>...........] 76%, ETC: 14.7m [=======================================>..........] 77%, ETC: 14.0m [=======================================>..........] 78%, ETC: 13.3m [========================================>.........] 79%, ETC: 12.7m [========================================>.........] 80%, ETC: 12.1m [=========================================>........] 81%, ETC: 11.5m [=========================================>........] 82%, ETC: 11.0m [==========================================>.......] 83%, ETC: 10.4m [==========================================>.......] 84%, ETC: 9.7m [===========================================>......] 85%, ETC: 9.2m [===========================================>......] 86%, ETC: 8.6m [============================================>.....] 87%, ETC: 8.0m [============================================>.....] 88%, ETC: 7.3m [=============================================>....] 89%, ETC: 6.7m [=============================================>....] 90%, ETC: 6.1m [==============================================>...] 91%, ETC: 5.4m [==============================================>...] 92%, ETC: 4.8m [===============================================>..] 93%, ETC: 4.3m [===============================================>..] 94%, ETC: 3.7m [================================================>.] 95%, ETC: 3.0m [================================================>.] 96%, ETC: 2.4m [=================================================>] 97%, ETC: 1.8m [=================================================>] 98%, ETC: 1.1m [==================================================] 99%, ETC: 32s [==================================================] 100%, completed, 1.0h Tue Jun 8 21:06:07 2021 Begin (eigenvalues and eigenvectors) Fri Jun 11 01:05:21 2021 Done. Predicting PC Values for the Related Set... SNP Loading:

of samples: 65,135

# of SNPs: 33,765
using 40 threads
using the top 32 eigenvectors

SNP Loading: the sum of all selected genotypes (0,1,2) = 1275566990 Fri Jun 11 01:05:32 2021 (internal increment: 380)

[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 10s Fri Jun 11 01:05:42 2021 Done. Sample Loading:

of samples: 10,865

# of SNPs: 33,765
using 40 threads
using the top 32 eigenvectors

Sample Loading: the sum of all selected genotypes (0,1,2) = 212141863 Fri Jun 11 01:05:49 2021 (internal increment: 2288)

[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 8s Fri Jun 11 01:05:57 2021 Done. [1] "end PCair 1"

############################################################ run 2: using R 4.1.0/GENESIS_2.22.1, 12 cores, 1T mem, added "export OMP_NUM_THREADS=1", the PCair gave "segfault" ############################################################

====== start run_PCair_PCrelate_1st_iter_V1_largemem.R ======

Attaching package: 'dplyr' The following objects are masked from 'package:data.table': between, first, last The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union

Loading required package: gdsfmt SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2) Loading required package: Biobase Loading required package: BiocGenerics Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:dplyr':

combine, intersect, setdiff, union

The following objects are masked from 'package:stats':

IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colnames, dirname, do.call,
duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
tapply, union, unique, unsplit, which.max, which.min

Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

[1] "workDir = ./" [1] "PLINKFilePrefix = test_76K_num" [1] "kinMatrixFileName = test_76K_num.kc.matrix.txt" [1] "iteration = 1" [1] "KINGmat: 76000 x 76000" Start file conversion from PLINK BED to SNP GDS ... BED file: 'test_76K_num.bed' SNP-major mode (Sample X SNP), 611.8M FAM file: 'test_76K_num.fam' BIM file: 'test_76K_num.bim' Thu Jun 17 16:50:24 2021 (store sample id, snp id, position, and chromosome) start writing: 76000 samples, 33765 SNPs ...

[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 9s Thu Jun 17 16:50:33 2021 Done. Optimize the access efficiency ... Clean up the fragments of GDS file: open the file 'test_76K_num.gds' (612.1M)

of fragments: 39

save to 'test_76K_num.gds.tmp'
rename 'test_76K_num.gds.tmp' (612.1M, reduced: 252B)
# of fragments: 18

[1] "start PCair 1" Using kinobj and divobj to partition samples into unrelated and related sets Working with 76000 samples Identifying relatives for each sample using kinship threshold 0.0220970869120796 Identifying pairs of divergent samples using divergence threshold -0.0220970869120796 Partitioning samples into unrelated and related sets... ...1000 samples added to related.set... ...2000 samples added to related.set... ...3000 samples added to related.set... ...4000 samples added to related.set... ...5000 samples added to related.set... ...6000 samples added to related.set... ...7000 samples added to related.set... ...8000 samples added to related.set... ...9000 samples added to related.set... ...10000 samples added to related.set... Unrelated Set: 65135 Samples Related Set: 10865 Samples Performing PCA on the Unrelated Set... Principal Component Analysis (PCA) on genotypes: Excluding 0 SNP on non-autosomes Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)

of samples: 65,135

# of SNPs: 33,765
using 12 threads
# of principal components: 32

PCA: the sum of all selected genotypes (0,1,2) = 1275566990 CPU capabilities: Double-Precision SSE2 Thu Jun 17 17:24:23 2021 (internal increment: 64) [..................................................] 0%, ETC: --- [=>................................................] 1%, ETC: 3.5h [=>................................................] 2%, ETC: 3.4h [==>...............................................] 3%, ETC: 3.3h [==>...............................................] 4%, ETC: 3.3h [===>..............................................] 5%, ETC: 3.2h [===>..............................................] 6%, ETC: 3.2h [====>.............................................] 7%, ETC: 3.1h [====>.............................................] 8%, ETC: 3.1h [=====>............................................] 9%, ETC: 3.1h [=====>............................................] 10%, ETC: 3.0h [======>...........................................] 11%, ETC: 3.0h [======>...........................................] 12%, ETC: 2.9h [=======>..........................................] 13%, ETC: 2.9h [=======>..........................................] 14%, ETC: 2.9h [========>.........................................] 15%, ETC: 2.8h [========>.........................................] 16%, ETC: 2.8h [=========>........................................] 17%, ETC: 2.8h [=========>........................................] 18%, ETC: 2.7h [==========>.......................................] 19%, ETC: 2.7h [==========>.......................................] 20%, ETC: 2.6h [===========>......................................] 21%, ETC: 2.6h [===========>......................................] 22%, ETC: 2.6h [============>.....................................] 23%, ETC: 2.5h [============>.....................................] 24%, ETC: 2.5h [=============>....................................] 25%, ETC: 2.5h [=============>....................................] 26%, ETC: 2.4h [==============>...................................] 27%, ETC: 2.4h [==============>...................................] 28%, ETC: 2.4h [===============>..................................] 29%, ETC: 2.4h [===============>..................................] 30%, ETC: 2.3h [================>.................................] 31%, ETC: 2.3h [================>.................................] 32%, ETC: 2.3h [=================>................................] 33%, ETC: 2.2h [=================>................................] 34%, ETC: 2.2h [==================>...............................] 35%, ETC: 2.2h [==================>...............................] 36%, ETC: 2.2h [===================>..............................] 37%, ETC: 2.1h [===================>..............................] 38%, ETC: 2.1h [====================>.............................] 39%, ETC: 2.1h [====================>.............................] 40%, ETC: 2.0h [=====================>............................] 41%, ETC: 2.0h [=====================>............................] 42%, ETC: 2.0h [======================>...........................] 43%, ETC: 1.9h [======================>...........................] 44%, ETC: 1.9h [=======================>..........................] 45%, ETC: 1.9h [=======================>..........................] 46%, ETC: 1.8h [========================>.........................] 47%, ETC: 1.8h [========================>.........................] 48%, ETC: 1.8h [=========================>........................] 49%, ETC: 1.7h [=========================>........................] 50%, ETC: 1.7h [==========================>.......................] 51%, ETC: 1.7h [==========================>.......................] 52%, ETC: 1.6h [===========================>......................] 53%, ETC: 1.6h [===========================>......................] 54%, ETC: 1.6h [============================>.....................] 55%, ETC: 1.5h [============================>.....................] 56%, ETC: 1.5h [=============================>....................] 57%, ETC: 1.5h [=============================>....................] 58%, ETC: 1.4h [==============================>...................] 59%, ETC: 1.4h [==============================>...................] 60%, ETC: 1.4h [===============================>..................] 61%, ETC: 1.4h [===============================>..................] 62%, ETC: 1.3h [================================>.................] 63%, ETC: 1.3h [================================>.................] 64%, ETC: 1.3h [=================================>................] 65%, ETC: 1.2h [=================================>................] 66%, ETC: 1.2h [==================================>...............] 67%, ETC: 1.1h [==================================>...............] 68%, ETC: 1.1h [===================================>..............] 69%, ETC: 1.1h [===================================>..............] 70%, ETC: 1.1h [====================================>.............] 71%, ETC: 1.0h [====================================>.............] 72%, ETC: 59.5m [=====================================>............] 73%, ETC: 57.0m [=====================================>............] 74%, ETC: 54.9m [======================================>...........] 75%, ETC: 53.1m [======================================>...........] 76%, ETC: 51.4m [=======================================>..........] 77%, ETC: 49.1m [=======================================>..........] 78%, ETC: 47.0m [========================================>.........] 79%, ETC: 45.0m [========================================>.........] 80%, ETC: 42.7m [=========================================>........] 81%, ETC: 40.7m [=========================================>........] 82%, ETC: 38.7m [==========================================>.......] 83%, ETC: 36.8m [==========================================>.......] 84%, ETC: 34.4m [===========================================>......] 85%, ETC: 32.2m [===========================================>......] 86%, ETC: 30.3m [============================================>.....] 87%, ETC: 28.4m [============================================>.....] 88%, ETC: 25.9m [=============================================>....] 89%, ETC: 23.8m [=============================================>....] 90%, ETC: 21.7m [==============================================>...] 91%, ETC: 19.2m [==============================================>...] 92%, ETC: 17.1m [===============================================>..] 93%, ETC: 15.0m [===============================================>..] 94%, ETC: 13.0m [================================================>.] 95%, ETC: 10.5m [================================================>.] 96%, ETC: 8.4m [=================================================>] 97%, ETC: 6.4m [=================================================>] 98%, ETC: 3.9m [==================================================] 99%, ETC: 1.9m [==================================================] 100%, completed, 3.5h Thu Jun 17 20:53:09 2021 Begin (eigenvalues and eigenvectors)

caught segfault address 0x229eef0, cause 'memory not mapped'

Traceback: 1: snpgdsPCA(gdsobj, sample.id = part$unrels, snp.id = snp.include, num.thread = num.cores, verbose = verbose, ...) 2: withCallingHandlers(expr, message = function(c) if (inherits(c, classes)) tryInvokeRestart("muffleMessage")) 3: suppressMessages(snpgdsPCA(gdsobj, sample.id = part$unrels, snp.id = snp.include, num.thread = num.cores, verbose = verbose, ...)) 4: .pcair(gdsobj = gdsobj, kinobj = kinobj, divobj = divobj, kin.thresh = kin.thresh, div.thresh = div.thresh, unrel.set = unrel.set, sample.include = sample.include, snp.include = snp.include, num.cores = num.cores, verbose = verbose, ...) 5: .local(gdsobj, ...) 6: pcair(gdsobj@handler, ...) 7: pcair(gdsobj@handler, ...) 8: pcair(gdsobj@data, ...) 9: pcair(gdsobj@data, ...) 10: pcair(genoData, num.cores = 12, kinobj = KINGmat, divobj = KINGmat) 11: pcair(genoData, num.cores = 12, kinobj = KINGmat, divobj = KINGmat) An irrecoverable exception occurred. R is aborting now ... /var/spool/slurm/d/job4806887/slurm_script: line 109: 39833 Segmentation fault Rscript run_PCair_PCrelate_1st_iter_V1_largemem.R $workDir "$plinkPrefix"_num $kcMatrixFileName run_PCair_PCrelate_1st_iter_V1_largemem.R failed

Any insights on what has caused this or how to get around this? The memory utilized for both runs is around 220 G.

Thank you!

zhengxwen commented 3 years ago

I guess you should use the fastPCA option in snpgdsPCA() (i.e., algorithm="randomized"). Please ask Stephanie for this option, I remember they successfully used fastPCA in PCRelate on very large datasets.

smgogarten commented 3 years ago

The fastPCA option is already available in GENESIS, as any additional arguments to the pcair function are passed directly to snpgdsPCA.

GraceSheng commented 3 years ago

Xiuwen and Stephanie, thank you so much for your suggestion! I have added algorithm="randomized" to the PCair function and my 76K sample dataset was able to run through pcair without problem.

GraceSheng commented 3 years ago

Hello Xiuwen and Stephanie, when I took the PCs estimated from PCair and tried to run my 76K sample set through pcrelate, it failed with the following error message:

Error in rbindlist(l, use.names, fill, idcol) : 
  Total rows in the list is 2165981000 which is larger than the maximum number of rows, currently 2147483647
Calls: pcrelate ... pcrelate -> .local -> .pcrelate -> rbind -> rbind -> rbindlist
In addition: Warning message:
In .pcrelate(gdsobj, pcs = pcs, scale = scale, ibd.probs = ibd.probs,  :
  small.samp.correct can only be used when all samples are analyzed in one block and `scale != none`
Execution halted
run_PCair_PCrelate_1st_iter_V1_largemem.R failed

It appears to hit the maximum row size (2^31-1) for a R data.table. Does this mean we cannot run pcrelate for more than ~65K samples? Any advice or insights? Thank you!

smgogarten commented 3 years ago

@GraceSheng the pcrelate issue belongs on the GENESIS page. (Unlike pcair, pcrelate does not use SNPRelate functions.)

GraceSheng commented 3 years ago

Thank you Stephanie! I opened a new issue on the GENESIS page https://github.com/UW-GAC/GENESIS/issues/70