xiaolei-lab / rMVP

:postbox: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool For Genome-Wide Association Study
Apache License 2.0
268 stars 69 forks source link

Missing genotype, PC, kinship data in MVP.Data output from .vcf #21

Closed naglemi closed 6 years ago

naglemi commented 6 years ago

I'm having trouble with my data in MVP that leads to outputs lacking all genotype information per individuals in the GWAS population.

My SNP data was in .tped/.tfam, so I converted it to .vcf with PLINK before loading this data into MVP.Data. Here's the first 10 items from the last line of my .vcf input: scaffold_724 20159 724_20159 2 1 . . PR GT 0/1

I used this command:

MVP.Data(fileVCF="my_prefix.vcf",
         # used grep in bash and found 1529 header lines with ##, confirmed with head (also bash)
         vcf.jump=1529,
         sep.vcf="\t",
         filePhe="my_pheno_data.txt",
         fileKin=TRUE,
         filePC=TRUE,
         out="my_out")

Everything appeared to be working fine, until I realized when the processing was nearly finished that the output files were only about 200MB combined (and my .vcf input is 28GB). Then, this happened:

... [1] "Number of Markers Written into File: 1: 8252000" [1] "Number of Markers Written into File: 1: 8252993" [1] "Preparation for numeric data is done!" [1] "Preparation for PHENOTYPE data is done!" [1] "Output mvp genotype..." [1] "Calculate KINSHIP using Vanraden method..." [1] "Preparation for Kinship matrix is done!" [1] "Principal Component Analysis Start ..." means for first 10 snps: [1] 0 0 0 0 0 0 0 0 0 0 [1] "Preparation for PC matrix is done!" [1] "MVP data prepration accomplished successfully!" Warning message: In big.PCA(bigMat = big.geno, pcs.to.keep = pcs.keep) : selected too many PCs to keep [5], changing to 2

> Covariates <- attach.big.matrix("my_prefix.pc.desc")
> head(Covariates)
     [,1] [,2]
[1,]    1    0
[2,]    0    1
> head(Kinship)
     [,1] [,2]
[1,]    0    0
[2,]    0    0
--

The genotype file (.map) contains only 3 columns and doesn't show the genotypes for each individual, just the Chr, SNP position and SNP ID.

Thanks for making this tool and for the help!

YinLiLin commented 6 years ago

Thanks for your using of MVP!

I suggest that you could transform your .tped/.tfam files to binary files instead of .vcf, then try MVP.Data again. If there is still something wrong, please let know!

Cheers!

XiaoleiLiuBio commented 6 years ago

Hi naglemi,

Thank you for using MVP. Please use dim() function in R to show me the dimension of the data generated by MVP.Data.

Please attention that .map is SNP information file, and the genotype file is suffixed with ".desc" and ".bin".

Best, Xiaolei

naglemi commented 6 years ago

Thanks for taking a look.

genotype <- attach.big.matrix("mvp.917geno_a4_fromvcf.geno.desc") dim(genotype) [1] 8252993 2 phenotype <- read.table("mvp.917geno_a4_fromvcf.phe",head=TRUE) dim(phenotype) [1] 917 15 map <- read.table("mvp.917geno_a4_fromvcf.map" , head = TRUE) dim(map) [1] 8252993 3 Kinship <- attach.big.matrix("mvp.917geno_a4_fromvcf.kin.desc") dim(Kinship) [1] 2 2 Covariates <- attach.big.matrix("mvp.917geno_a4_fromvcf.pc.desc") dim(Covariates) [1] 2 2

I prepared PLINK binaries and am also working on trying to load those an alternative to the .vcf. This is in progress.

Edit: It seems to be working with the PLINK binaries (although still in the early stages of running)! The issue with my .vcf input is not resolved, but I might be able to avoid the issue completely for now.

XiaoleiLiuBio commented 6 years ago

Please double check "vcf.jump" and "sep.vcf", and also check if the vcf file you used have the same format as displayed in the user manual

naglemi commented 6 years ago

I made sure there are 1529 header lines, so vcf.jump has the correct setting, right?

> grep -i "##" all_emmax_format_12_no_dup_real.vcf | wc -l
1529

The file is tab-seperated. Below are the first 10 columns of lines 1530 and 1531, which appear to match the format in the user manual.

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT my_genotype_1

1 52 01_52 2 1 . . PR GT 0/0

Perhaps the issue is a results of my file size (28GB) and limited memory available (64GB)? This seems to be causing an issue when working with my .bed file, which I'm about to make a separate issue post for.

Thanks for the help

XiaoleiLiuBio commented 6 years ago

It is not a memory issue. The big vcf file would be read into memory simultaneously. Could you please send me a small vcf file for the test. I am curious that why the top 1529 lines are all annotations.

naglemi commented 6 years ago

I sent the top 1600 lines of the vcf. As I mentioned in my message, there are many unassembled contigs, and some of these are as characters instead of numeric (as was necessary to prepare the vcf with PLINK due to limit of numeric contigs allowed)... Maybe the first thing I should do is make them all numeric?

XiaoleiLiuBio commented 6 years ago

Hi Michael Nagle,

Thank you for sharing the demo data, and you are right that there are 1529 header lines. I run following codes and it works:

MVP.Data(fileVCF="nagle_poplar_SNPs_head1600.vcf",

filePhe="Phenotype.txt",

     vcf.jump=1529,
     sep.vcf="\t",
     #sep.phe="\t",
     fileKin=FALSE,
     filePC=FALSE,
     out="mvp.vcf",
     #maxLine=10000
     )

But I find a problem in the data, the last header line, which starts with "#", there are 918 sample IDs while there are only 917 samples on the genotypic data lines (0/0,0/1,1/1 line).

Best, Xiaolei

hyacz commented 6 years ago

Closed because there is no activity for too long. if you have any related new messages, feel free to open it.