zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

Sample.idError with seqVCF2GDS #87

Open alexisregelson opened 9 months ago

alexisregelson commented 9 months ago

Hello, I am trying to use seqVCF2GDS and am getting the following error:

library(SeqArray) library(data.table)

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=6L) Mon Nov 6 16:09:06 2023 Variant Call Format (VCF) Import: file(s): r4_PASS_chr1_updated_varID_dups_drop_updated_IDs_nhw_hwe6_noNHWrelateds_high_mod_impact.vcf (198.8M) file format: VCFv4.2 the number of sets of chromosomes (ploidy): 2 the number of samples: 14,306 genotype storage: bit2 compression method: LZMA_RA

of samples: 14306

calculating the total number of variants ...
the total number of variants for import: 3,632
Writing to 6 files:
    r4_chr1_high_mod_tmp01_ad336f56fc72 [1..606]
    r4_chr1_high_mod_tmp02_ad3315e862b7 [607..1,212]
    r4_chr1_high_mod_tmp03_ad33613818b1 [1,213..1,818]
    r4_chr1_high_mod_tmp04_ad33473817c6 [1,819..2,424]
    r4_chr1_high_mod_tmp05_ad334e0fea8c [2,425..3,030]
    r4_chr1_high_mod_tmp06_ad33607634f8 [3,031..3,632]
Done (Mon Nov  6 16:09:10 2023).

Output: r4_chr1_high_mod.gds Merging: opening 'r4_chr1_high_mod_tmp01_ad336f56fc72' ... [done] opening 'r4_chr1_high_mod_tmp02_ad3315e862b7' ... [done] opening 'r4_chr1_high_mod_tmp03_ad33613818b1' ... [done] opening 'r4_chr1_high_mod_tmp04_ad33473817c6' ... [done] opening 'r4_chr1_high_mod_tmp05_ad334e0fea8c' ... [done] opening 'r4_chr1_high_mod_tmp06_ad33607634f8' ... [done] Digests: sample.idError: segfault from C stack overflow

Do the sampel IDs need to be in a particular format? I created my vcf with plink and used double-id option. IDs are in format: A-[Cohort]-[A#####]. A .gds file is outputed, but I don't know if it's is incorrect due to the segfault.

gds <- seqOpen(r4_chr1_high_mod.gds) gds Object of class "SeqVarGDSClass" File: r4_chr1_high_mod.gds (294.4K)

sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/lib/libR.so LAPACK: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/modules/lapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.14.8 SeqArray_1.26.2 gdsfmt_1.22.0

loaded via a namespace (and not attached): [1] zlibbioc_1.32.0 compiler_3.6.0 IRanges_2.20.2
[4] XVector_0.26.0 parallel_3.6.0 GenomicRanges_1.38.0
[7] GenomeInfoDbData_1.2.2 RCurl_1.95-4.12 Biostrings_2.54.0
[10] S4Vectors_0.24.4 BiocGenerics_0.32.0 GenomeInfoDb_1.22.1
[13] bitops_1.0-6 stats4_3.6.0

Thank you, Alexis

zhengxwen commented 9 months ago

See: the total number of variants for import: 3,632 This number is too small, parallel=6L does not help at all. I guess parallel=6L might trigger a bug when merging the data files when the number of variants is too small.

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=1)

It might solve your problem.

alexisregelson commented 7 months ago

Hello,

I've now tried this with a vcf with a 200k+ varaints. I have successfully converted this vcf to a gds using SNPRelate. However, I am using another software that specifically needs the gds file in SeqArray format, not SNPRelate. But I am still getting the same error: sample.idError: segfault from C stack overflow.

Alexis

zhengxwen commented 7 months ago

Your R version and gdsfmt versions are old. The recent update was made with a focus on R (>= v4.0). I suggest using SeqArray GDS format instead of SNPRelate GDS.