zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
98 stars 25 forks source link

snpgdsVCF2GDS crashes on approx. line 27K #5

Closed dereneaton closed 9 years ago

dereneaton commented 10 years ago

I have tried to read in three different VCF files using the function snpgdsVCF2GDS(). Each crashes with the error below, and says the error is on approximately line 27K of the file. Each file is from a completely different data set, so I see no reason there should be a similar error around line 27K in each file, nor do I see any aberration on the indicated line. Moreover, if I read in only the first 26K lines from any of the files they work fine.

These data files represent a collection of RADseq loci where each locus is labeled as a separate chromosome. Perhaps the large number of chromosomes (>10K) is what crashes the function? Again, the formatting doesn't seem to be an issue for the first 26K lines, so I assume the problem is arising in R or SNPRelate only as the data gets larger.

Below is an example from one of the data sets:

## Read in VCF
vcffile <- "/home/deren/Documents/Oaks/Virentes/analysis_pyrad/outfiles/virentes_c85d6m20p5noutg.vcf"

## Reformat and write to GDS format 
snpgdsVCF2GDS(vcffile, "test.gds", method="biallelic.only")

FILE: /home/deren/Documents/Oaks/Virentes/analysis_pyrad/outfiles/virentes_c85d6m20p5noutg.vcf LINE: 26733, COLUMN: 7, PASS Invalid Zip Deflate Stream operation 'Seek'!

Please let me know if perhaps I should change the format of the files, or if there is an easy fix. Thanks.

zhengxwen commented 10 years ago

Please try the function snpgdsVCF2GDS in the lower version of SNPRelate (v0.9.19 available on R-Forge), if you want an immediate solution:

install.packages("SNPRelate", repos="http://R-Forge.R-project.org")

We have observed that snpgdsVCF2GDS might randomly exist during the import of large files. We are trying to resolve the problem before BioC3.0 release.

dagola commented 10 years ago

I experience the same issue when trying to convert a PLINK binary PED file to GDS. The BED consists of ~14 000 samples and 5 433 184 markers, it's size is ~18.8 GB. The output of snpgdsBED2GDS is:

Start snpgdsBED2GDS ...
        BED file: "GerMIFS.bed" in the SNP-major mode (Sample X SNP)
        FAM file: "GerMIFS.fam", DONE.
        BIM file: "GerMIFS.bim", DONE.
Error in add.gdsn(gfile, "snp.id", snp.id, compress = compress.annotation,  : 
  Invalid Zip Deflate Stream operation 'Seek'!

Currently, I'm using SNPRelate 0.99.1. Please let me know if downgrading will help here, too.

Many thanks!

zhengxwen commented 10 years ago

Please also downgrade the package gdsfmt from v1.1.0 to v1.0.4, run:

install.packages("gdsfmt", repos="http://bioconductor.org/packages/release/extra")

The error Invalid Zip Deflate Stream operation 'Seek'! should come from gdsfmt.

dagola commented 10 years ago

Thanks for your quick reply. Your suggested solution worked for me. Nevertheless, setting compress.annotation = "" in snpgdsBED2GEDin version 0.99.1 works, too.

zhengxwen commented 9 years ago

Please update gdsfmt to v1.1.1.1