Closed dereneaton closed 9 years ago
Please try the function snpgdsVCF2GDS
in the lower version of SNPRelate (v0.9.19 available on R-Forge), if you want an immediate solution:
install.packages("SNPRelate", repos="http://R-Forge.R-project.org")
We have observed that snpgdsVCF2GDS
might randomly exist during the import of large files. We are trying to resolve the problem before BioC3.0 release.
I experience the same issue when trying to convert a PLINK binary PED file to GDS. The BED consists of ~14 000 samples and 5 433 184 markers, it's size is ~18.8 GB. The output of snpgdsBED2GDS is:
Start snpgdsBED2GDS ...
BED file: "GerMIFS.bed" in the SNP-major mode (Sample X SNP)
FAM file: "GerMIFS.fam", DONE.
BIM file: "GerMIFS.bim", DONE.
Error in add.gdsn(gfile, "snp.id", snp.id, compress = compress.annotation, :
Invalid Zip Deflate Stream operation 'Seek'!
Currently, I'm using SNPRelate 0.99.1. Please let me know if downgrading will help here, too.
Many thanks!
Please also downgrade the package gdsfmt
from v1.1.0 to v1.0.4, run:
install.packages("gdsfmt", repos="http://bioconductor.org/packages/release/extra")
The error Invalid Zip Deflate Stream operation 'Seek'!
should come from gdsfmt
.
Thanks for your quick reply. Your suggested solution worked for me.
Nevertheless, setting compress.annotation = ""
in snpgdsBED2GED
in version 0.99.1 works, too.
Please update gdsfmt to v1.1.1.1
I have tried to read in three different VCF files using the function snpgdsVCF2GDS(). Each crashes with the error below, and says the error is on approximately line 27K of the file. Each file is from a completely different data set, so I see no reason there should be a similar error around line 27K in each file, nor do I see any aberration on the indicated line. Moreover, if I read in only the first 26K lines from any of the files they work fine.
These data files represent a collection of RADseq loci where each locus is labeled as a separate chromosome. Perhaps the large number of chromosomes (>10K) is what crashes the function? Again, the formatting doesn't seem to be an issue for the first 26K lines, so I assume the problem is arising in R or SNPRelate only as the data gets larger.
Below is an example from one of the data sets:
FILE: /home/deren/Documents/Oaks/Virentes/analysis_pyrad/outfiles/virentes_c85d6m20p5noutg.vcf LINE: 26733, COLUMN: 7, PASS Invalid Zip Deflate Stream operation 'Seek'!
Please let me know if perhaps I should change the format of the files, or if there is an easy fix. Thanks.