zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
102 stars 25 forks source link

all SNPs falsely labelled as monomorphic #37

Open webbchen opened 6 years ago

webbchen commented 6 years ago

Dear Xiuwen Zhen

I've imported a vcf file, using snpgdsVCF2GDS, with SNPs for 10 samples, which looks like this:

(lots of comment lines)

(...)

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample08 sample13 sample36 sample50 sample65 sampleSTR sampleIPO323 sample02 sample56 sample72

1 125 fungus_MG2_Chr_1_125_C_A C A 143.4 PASS . GT . . . . . . 1 . . . 1 5392 fungus_MG2_Chr_1_5392_T_C T C 284.8 PASS . GT . . . 1 . . . . . . 1 5619 fungus_MG2_Chr_1_5619_G_T G T 438.7 PASS . GT 1 . . 1 . . . . . . 1 5868 fungus_MG2_Chr_1_5868_T_C T C 843.1 PASS . GT 1 . . 1 . . . 1 . .

I imported it with snpgdsVCF2GDS, opened it with snpgdsOpen and ran snpgdsLDpruning and snpgdsPCA on the dataset. When importing I tried both methods, biallelic.only and copy.num.of.ref . The importing seems to works fine but the LD pruning method and the PCA exclude all SNPs for being monomorphic, which they're not. I assume something in the imported vcf file is wrongly formatted, causing that behaviour. Do you know what it could be? Kind regards,

Anne Webb

zhengxwen commented 6 years ago

Not sure where the problem is. Could you please try SeqArray, which is the extended version of SNP GDS? You could send me your VCF file, if it is not confidential.

timedreamer commented 6 years ago

Just found I have the same problem #45. I attached some lines for demonstration. Let me know if you need a test file. Thanks.

timedreamer commented 6 years ago

Just test using SeqArray still the same problem.

seqVCF2GDS("merge_11.vcf.gz", "tmp.gds", storage.option="ZIP_RA")
genofile <- seqOpen("tmp.gds")
dissMatrix  <-  snpgdsDiss(genofile, sample.id=NULL, snp.id=NULL, autosome.only = F,remove.monosnp=F, maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)

The output is still no SNPs.

Individual dissimilarity analysis on genotypes:
Calculating allele counts/frequencies ...
Excluding 2,068,267 SNVs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 11 samples, 0 SNV
    using 2 (CPU) cores
Error in .InitFile2(cmd = "Individual dissimilarity analysis on genotypes:",  : 
  There is no SNP!