zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
101 stars 25 forks source link

Fails on large vcf #10

Closed Gibbsdavidl closed 9 years ago

Gibbsdavidl commented 10 years ago

Version (v0.9.19) was able to convert this file.

But with 1.1.1 I'm getting the error:

snpgdsVCF2GDS(vcf.fn, out2.fn, method="biallelic.only") VCF format ---> GDS SNP format: Parsing "MAF01_merged_sorted_filteredVCF_v5.vcf.gz" ... Error: FILE: MAF01_merged_sorted_filteredVCFv5.vcf.gz LINE: 3257158, COLUMN: 1986, 0/� Invalid integer conversion "�_".

Gibbsdavidl commented 10 years ago

I'm going to run it again, to see if the error occurs on the same line. If so, I'll parse out that line and see what that genotype is.

zhengxwen commented 10 years ago

Please try:

snpgdsVCF2GDS(vcf.fn, out2.fn, method="copy.num.of.ref")

When method="copy.num.of.ref", I have not seen any error.

Gibbsdavidl commented 10 years ago

OK,

I’ll give that a try. In the meantime, I tried to run it again … and it completed!

Here’s a question. In my VCF, the chromosomes are labeled “chr1”, “chr2”, and so on.

So, after searching around on the internet, I found this: snpgdsOption(chr1=1, chr2=2, chr3=3, chr4=4, chr5=5, chr6=6, chr7=7, … , chr22=22, chrX=23, chrY=24, chrM=25)

Then after converting the file,

File: MAF01_merged_sorted_filteredVCF_v5_ccm_2.gds

read.gdsn(index.gdsn(genofile, "snp.chromosome"))[1:10] [1] "chrM" "chrM" "chrM" "chrM" "chrM" "chrM" "chrM" "chrM" "chrM" "chrM"

snpset2 <- snpgdsLDpruning(genofile, num.thread=4, ld.threshold=0.2, autosome.only=F) SNP pruning based on LD: Removing 0 SNPs (monomorphic, < MAF, or > missing rate) Working space: 1977 samples, 6971608 SNPs Using 4 (CPU) cores Sliding window: 500000 basepairs, Inf SNPs |LD| threshold: 0.2 Chromosome chrM: 18.32%, 35/191 Chromosome chrX: 1.00%, 2555/256086 Chromosome chr1: 4.74%, 25253/533059 ….

Looks like doing that chromosome mapping is working. Is that the “right” way to do it? I was thinking that in the documentation for VCF to gds conversion, you might want to mention the chromosome mapping method.

THANKS SO MUCH!!!! :) -dave

On Nov 12, 2014, at 1:55 PM, Xiuwen Zheng notifications@github.com wrote:

Please try:

snpgdsVCF2GDS(vcf.fn, out2.fn, method="copy.num.of.ref")

When method="copy.num.of.ref", I have not seen any error.

— Reply to this email directly or view it on GitHub.

zhengxwen commented 9 years ago

Sorry for the delay.

The option snpgdsOption is an argument in the function snpgdsVCF2GDS_R, which is designed for the compatibility with old version of SNPRelate (<= 0.9.19). In the old version of SNPRelate, snp.chromosome should be an integer, so users have to define chromosome mapping, like snpgdsOption(chr1=1, chr2=2, chr3=3, chr4=4, chr5=5, chr6=6, chr7=7, … , chr22=22, chrX=23, chrY=24, chrM=25) for humans.

In the new version of SNPRelate (>=1.0.0), snp.chromosome could be character and chromosome mapping is not required. The function snpgdsVCF2GDS always uses characters to represent chromosomes, so that it is easy to import VCF files of other species (like dog, whale).

In your case, the chromosome coding has a prefix "chr". SNPRelate can not automatically convert "chr1" to "1", please specify ignore.chr.prefix="chr" in snpgdsVCF2GDS. So you do not have to use autosome.only=FALSE in the function snpgdsLDpruning or others, like snpgdsPCA.

Gibbsdavidl commented 9 years ago

Thanks! I'll update my scripts.

Seems to have worked well anyway. It was used on three groups of ~850 genomes!

-dave

On Dec 14, 2014, at 1:42 AM, Xiuwen Zheng notifications@github.com wrote:

Sorry for the delay.

The option snpgdsOption is an argument in the function snpgdsVCF2GDS_R, which is designed for the compatibility with old version of SNPRelate (<= 0.9.19). In the old version of SNPRelate, snp.chromosome should be an integer, so users have to define chromosome mapping, like snpgdsOption(chr1=1, chr2=2, chr3=3, chr4=4, chr5=5, chr6=6, chr7=7, … , chr22=22, chrX=23, chrY=24, chrM=25) for humans.

In the new version of SNPRelate (>=1.0.0), snp.chromosome could be character and chromosome mapping is not required. The function snpgdsVCF2GDS always uses characters to represent chromosomes, so that it is easy to import VCF files of other species (like dog, whale).

In your case, the chromosome coding has a prefix "chr". SNPRelate can not automatically convert "chr1" to "1", please specify ignore.chr.prefix="chr" in snpgdsVCF2GDS. So you do not have to use autosome.only=FALSE in the function snpgdsLDpruning or others, like snpgdsPCA.

— Reply to this email directly or view it on GitHub.