zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

Segfault with seqVCF2GDS #18

Closed cheesemania closed 7 years ago

cheesemania commented 7 years ago

I'm running R3.3.1 on Ubuntu, an previously installed SeqArray on a mac (R3.3.1 again) with no issues

I'm getting a Seqfault with the seqVCF2GDS function. I've reinstalled R, plus all dependencies and have no alterations in the outcome (and used both versions 1.14 and 1.15 of SeqArray with 1.10 and 1.11 of gdsfmt). Output pasted below, any ideas what may be going wrong here?

seqVCF2GDS("Snps.hapcall_recal_SNPs_core_sel.vcf.gz","out.gds") Wed Oct 26 17:08:36 2016 Variant Call Format (VCF) Import: file(s): Snps.hapcall_recal_SNPs_core_sel.vcf.gz (2.1M) file format: unknown the number of sets of chromosomes (ploidy): 2 the number of samples: 138 genotype storage: bit2 compression method: ZIP_RA variable id in the FORMAT field should be defined ahead, and the undefined id is/are ignored during the conversion. Output: out.gds Error in (function (node, name, val = NULL, storage = storage.mode(val), : Stream read error

* caught segfault * address 0x80, cause 'memory not mapped'

zhengxwen commented 7 years ago

Note that in your file:

file format: unknown

It seems that the header of your VCF file does not contain annotation information. At least SeqArray works with >=VCFv4.0

The possible solution is that you edit the VCF file and add VCF header with standard format defined in VCFv4.0.

cheesemania commented 7 years ago

I tried a new file (v4.2 vcf) and have the same problem.

library(SeqArray) Loading required package: gdsfmt seqVCF2GDS("Snps.hapcall_recal_SNPs.vcf.gz","out.gds") Thu Oct 27 09:21:06 2016 Variant Call Format (VCF) Import: file(s): Snps.hapcall_recal_SNPs.vcf.gz (3.5M) file format: VCFv4.2 the number of sets of chromosomes (ploidy): 2 the number of samples: 222 genotype storage: bit2 compression method: ZIP_RA Output: out.gds Error in (function (node, name, val = NULL, storage = storage.mode(val), : Stream read error

* caught segfault * address 0x80, cause 'memory not mapped'

Traceback: 1: closefn.gds(gfile) 2: seqVCF2GDS("Snps.hapcall_recal_SNPs.vcf.gz", "out.gds")

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

zhengxwen commented 7 years ago

Show me sessionInfo() please.

cheesemania commented 7 years ago

library(SeqArray) Loading required package: gdsfmt sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.1 LTS

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SeqArray_1.11.18 gdsfmt_1.7.17

loaded via a namespace (and not attached): [1] AnnotationDbi_1.36.0 XVector_0.14.0
[3] GenomicAlignments_1.10.0 GenomicRanges_1.26.1
[5] BiocGenerics_0.20.0 zlibbioc_1.20.0
[7] IRanges_2.8.0 BiocParallel_1.8.0
[9] BSgenome_1.42.0 lattice_0.20-33
[11] GenomeInfoDb_1.10.0 tools_3.3.0
[13] SummarizedExperiment_1.4.0 parallel_3.3.0
[15] grid_3.3.0 Biobase_2.34.0
[17] DBI_0.5-1 Matrix_1.2-6
[19] rtracklayer_1.34.0 S4Vectors_0.12.0
[21] bitops_1.0-6 RCurl_1.95-4.8
[23] biomaRt_2.30.0 RSQLite_1.0.0
[25] GenomicFeatures_1.26.0 Biostrings_2.42.0
[27] Rsamtools_1.26.1 stats4_3.3.0
[29] XML_3.98-1.4 VariantAnnotation_1.20.0

vcf.fn <- seqExampleFileName("vcf")

conversion

seqVCF2GDS(vcf.fn, "tmp.gds") Fri Oct 28 09:23:38 2016 The Variant Call Format (VCF) header: file format: VCFv4.0 the number of sets of chromosomes (ploidy): 2 the number of samples: 90 GDS genotype storage: bit2 Error in (function (node, name, val = NULL, storage = storage.mode(val), : Stream read error

* caught segfault * address 0x80, cause 'memory not mapped'

Traceback: 1: closefn.gds(gfile) 2: seqVCF2GDS(vcf.fn, "tmp.gds")

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

zhengxwen commented 7 years ago

See the session info:

R version 3.3.0 (2016-05-03)

other attached packages:
[1] SeqArray_1.11.18 gdsfmt_1.7.17

If you have difficulty instalingl the latest version of gdsfmt and SeqArray in R_3.3.0 via biocLite, please install the packages via GitHub:

library("devtools")
install_github("zhengxwen/gdsfmt")
install_github("zhengxwen/SeqArray")

Or you might send me your VCF file to zhengxwen@gmail.com

cheesemania commented 7 years ago

I'm happy to send a vcf if needed. However, I updated the packages and ran the test scripts with the same issues arising. Any ideas what may be up here?

sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.1 LTS

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] SeqArray_1.13.6 gdsfmt_1.8.3 devtools_1.12.0

loaded via a namespace (and not attached): [1] AnnotationDbi_1.36.0 XVector_0.14.0
[3] GenomicAlignments_1.10.0 GenomicRanges_1.26.1
[5] BiocGenerics_0.20.0 zlibbioc_1.20.0
[7] IRanges_2.8.0 BiocParallel_1.8.0
[9] BSgenome_1.42.0 lattice_0.20-33
[11] R6_2.2.0 httr_1.2.1
[13] GenomeInfoDb_1.10.0 tools_3.3.0
[15] SummarizedExperiment_1.4.0 parallel_3.3.0
[17] grid_3.3.0 Biobase_2.34.0
[19] DBI_0.5-1 git2r_0.15.0
[21] withr_1.0.2 digest_0.6.10
[23] Matrix_1.2-6 rtracklayer_1.34.0
[25] S4Vectors_0.12.0 bitops_1.0-6
[27] biomaRt_2.30.0 RCurl_1.95-4.8
[29] curl_2.2 RSQLite_1.0.0
[31] memoise_1.0.0 BiocInstaller_1.24.0
[33] GenomicFeatures_1.26.0 Biostrings_2.42.0
[35] Rsamtools_1.26.1 XML_3.98-1.4
[37] stats4_3.3.0 VariantAnnotation_1.20.0

vcf.fn <- seqExampleFileName("vcf") seqVCF2GDS(vcf.fn, "tmp.gds") Wed Nov 2 09:34:05 2016 Variant Call Format (VCF) Import: file(s): CEU_Exon.vcf.gz (226.0K) file format: VCFv4.0 the number of sets of chromosomes (ploidy): 2 the number of samples: 90 genotype storage: bit2 compression method: ZIP_RA Output: tmp.gds Error in (function (node, name, val = NULL, storage = storage.mode(val), : Stream read error

* caught segfault * address 0x80, cause 'memory not mapped'

Traceback: 1: closefn.gds(gfile) 2: seqVCF2GDS(vcf.fn, "tmp.gds")

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 3

zhengxwen commented 7 years ago

I cannot reproduce the error using virtual machine + Ubuntu 16.04.1 LTS and R 3.3.0.

Please show me which C/C++ compiler you are using, gcc/g++? g++ -v Are you able to run R CMD check gdsfmt_1.10.0.tar.gz?

gdsfmt_1.10.0.tar.gz is downloaded at: http://www.bioconductor.org/packages/release/bioc/src/contrib/gdsfmt_1.10.0.tar.gz

cheesemania commented 7 years ago

Success!!!!!

I ran the check command, installed RUnit and knitr and we are up and running!

Thanks so much for persevering with this problem, its very much appreciated

R CMD check gdsfmt_1.11.0.tar.gz

VignetteBuilder package required for checking but not installed: ‘knitr’

The suggested packages are required for a complete check. Checking can be attempted without them by setting the environment variable _R_CHECK_FORCESUGGESTS to a false value.

See section ‘The DESCRIPTION file’ in the ‘Writing R Extensions’ manual.

Status: 1 ERROR See ‘/home/ian/Downloads/gdsfmt.Rcheck/00check.log’ for details.