alexisregelson commented 1 year ago

Hello, I am trying to use seqVCF2GDS and am getting the following error:

library(SeqArray) library(data.table)

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=6L) Mon Nov 6 16:09:06 2023 Variant Call Format (VCF) Import: file(s): r4_PASS_chr1_updated_varID_dups_drop_updated_IDs_nhw_hwe6_noNHWrelateds_high_mod_impact.vcf (198.8M) file format: VCFv4.2 the number of sets of chromosomes (ploidy): 2 the number of samples: 14,306 genotype storage: bit2 compression method: LZMA_RA

of samples: 14306

calculating the total number of variants ...
the total number of variants for import: 3,632
Writing to 6 files:
    r4_chr1_high_mod_tmp01_ad336f56fc72 [1..606]
    r4_chr1_high_mod_tmp02_ad3315e862b7 [607..1,212]
    r4_chr1_high_mod_tmp03_ad33613818b1 [1,213..1,818]
    r4_chr1_high_mod_tmp04_ad33473817c6 [1,819..2,424]
    r4_chr1_high_mod_tmp05_ad334e0fea8c [2,425..3,030]
    r4_chr1_high_mod_tmp06_ad33607634f8 [3,031..3,632]
Done (Mon Nov  6 16:09:10 2023).

Output: r4_chr1_high_mod.gds Merging: opening 'r4_chr1_high_mod_tmp01_ad336f56fc72' ... [done] opening 'r4_chr1_high_mod_tmp02_ad3315e862b7' ... [done] opening 'r4_chr1_high_mod_tmp03_ad33613818b1' ... [done] opening 'r4_chr1_high_mod_tmp04_ad33473817c6' ... [done] opening 'r4_chr1_high_mod_tmp05_ad334e0fea8c' ... [done] opening 'r4_chr1_high_mod_tmp06_ad33607634f8' ... [done] Digests: sample.idError: segfault from C stack overflow

Do the sampel IDs need to be in a particular format? I created my vcf with plink and used double-id option. IDs are in format: A-[Cohort]-[A#####]. A .gds file is outputed, but I don't know if it's is incorrect due to the segfault.

gds <- seqOpen(r4_chr1_high_mod.gds) gds Object of class "SeqVarGDSClass" File: r4_chr1_high_mod.gds (294.4K)

[ ] |--+ description [ ] |--+ sample.id { Str8 14306 LZMA_ra(2.94%), 12.6K } |--+ variant.id { Int32 3632 LZMA_ra(12.7%), 1.8K } |--+ position { Int32 3632 LZMA_ra(62.3%), 8.8K } |--+ chromosome { Str8 3632 LZMA_ra(1.62%), 125B } |--+ allele { Str8 3632 LZMA_ra(24.4%), 4.0K } |--+ genotype [ ] | |--+ data { Bit2 2x14306x3632 LZMA_ra(0.95%), 242.2K } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Int16 0 LZMA_ra, 18B } |--+ phase [ ] | |--+ data { Bit1 14306x3632 LZMA_ra(0.02%), 1.3K } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Bit1 0 LZMA_ra, 18B } |--+ annotation [ ] | |--+ id { Str8 3632 LZMA_ra(28.1%), 16.0K } | |--+ qual { Float32 3632 LZMA_ra(0.92%), 141B } | |--+ filter { Int32 3632 LZMA_ra(0.92%), 141B } | |--+ info [ ] | | --+ PR { Bit1 3632 LZMA_ra(18.9%), 93B } | --+ format [ ] --+ sample.annotation [ ]

sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/lib/libR.so LAPACK: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/modules/lapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.14.8 SeqArray_1.26.2 gdsfmt_1.22.0

loaded via a namespace (and not attached): [1] zlibbioc_1.32.0 compiler_3.6.0 IRanges_2.20.2
[4] XVector_0.26.0 parallel_3.6.0 GenomicRanges_1.38.0
[7] GenomeInfoDbData_1.2.2 RCurl_1.95-4.12 Biostrings_2.54.0
[10] S4Vectors_0.24.4 BiocGenerics_0.32.0 GenomeInfoDb_1.22.1
[13] bitops_1.0-6 stats4_3.6.0

Thank you, Alexis

zhengxwen commented 1 year ago

See: the total number of variants for import: 3,632 This number is too small, parallel=6L does not help at all. I guess parallel=6L might trigger a bug when merging the data files when the number of variants is too small.

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=1)

It might solve your problem.

alexisregelson commented 10 months ago

Hello,

I've now tried this with a vcf with a 200k+ varaints. I have successfully converted this vcf to a gds using SNPRelate. However, I am using another software that specifically needs the gds file in SeqArray format, not SNPRelate. But I am still getting the same error: sample.idError: segfault from C stack overflow.

Alexis

zhengxwen commented 10 months ago

Your R version and gdsfmt versions are old. The recent update was made with a focus on R (>= v4.0). I suggest using SeqArray GDS format instead of SNPRelate GDS.

zhengxwen / SeqArray

Sample.idError with seqVCF2GDS #87

of samples: 14306