Closed naglemi closed 6 years ago
Thanks for your using of MVP!
I suggest that you could transform your .tped/.tfam files to binary files instead of .vcf, then try MVP.Data
again. If there is still something wrong, please let know!
Cheers!
Hi naglemi,
Thank you for using MVP. Please use dim() function in R to show me the dimension of the data generated by MVP.Data.
Please attention that .map is SNP information file, and the genotype file is suffixed with ".desc" and ".bin".
Best, Xiaolei
Thanks for taking a look.
genotype <- attach.big.matrix("mvp.917geno_a4_fromvcf.geno.desc") dim(genotype) [1] 8252993 2 phenotype <- read.table("mvp.917geno_a4_fromvcf.phe",head=TRUE) dim(phenotype) [1] 917 15 map <- read.table("mvp.917geno_a4_fromvcf.map" , head = TRUE) dim(map) [1] 8252993 3 Kinship <- attach.big.matrix("mvp.917geno_a4_fromvcf.kin.desc") dim(Kinship) [1] 2 2 Covariates <- attach.big.matrix("mvp.917geno_a4_fromvcf.pc.desc") dim(Covariates) [1] 2 2
I prepared PLINK binaries and am also working on trying to load those an alternative to the .vcf. This is in progress.
Edit: It seems to be working with the PLINK binaries (although still in the early stages of running)! The issue with my .vcf input is not resolved, but I might be able to avoid the issue completely for now.
Please double check "vcf.jump" and "sep.vcf", and also check if the vcf file you used have the same format as displayed in the user manual
I made sure there are 1529 header lines, so vcf.jump has the correct setting, right?
> grep -i "##" all_emmax_format_12_no_dup_real.vcf | wc -l
1529
The file is tab-seperated. Below are the first 10 columns of lines 1530 and 1531, which appear to match the format in the user manual.
1 52 01_52 2 1 . . PR GT 0/0
Perhaps the issue is a results of my file size (28GB) and limited memory available (64GB)? This seems to be causing an issue when working with my .bed file, which I'm about to make a separate issue post for.
Thanks for the help
It is not a memory issue. The big vcf file would be read into memory simultaneously. Could you please send me a small vcf file for the test. I am curious that why the top 1529 lines are all annotations.
I sent the top 1600 lines of the vcf. As I mentioned in my message, there are many unassembled contigs, and some of these are as characters instead of numeric (as was necessary to prepare the vcf with PLINK due to limit of numeric contigs allowed)... Maybe the first thing I should do is make them all numeric?
Hi Michael Nagle,
Thank you for sharing the demo data, and you are right that there are 1529 header lines. I run following codes and it works:
MVP.Data(fileVCF="nagle_poplar_SNPs_head1600.vcf",
vcf.jump=1529,
sep.vcf="\t",
#sep.phe="\t",
fileKin=FALSE,
filePC=FALSE,
out="mvp.vcf",
#maxLine=10000
)
But I find a problem in the data, the last header line, which starts with "#", there are 918 sample IDs while there are only 917 samples on the genotypic data lines (0/0,0/1,1/1 line).
Best, Xiaolei
Closed because there is no activity for too long. if you have any related new messages, feel free to open it.
I'm having trouble with my data in MVP that leads to outputs lacking all genotype information per individuals in the GWAS population.
My SNP data was in .tped/.tfam, so I converted it to .vcf with PLINK before loading this data into MVP.Data. Here's the first 10 items from the last line of my .vcf input:
scaffold_724 20159 724_20159 2 1 . . PR GT 0/1
I used this command:
Everything appeared to be working fine, until I realized when the processing was nearly finished that the output files were only about 200MB combined (and my .vcf input is 28GB). Then, this happened:
... [1] "Number of Markers Written into File: 1: 8252000" [1] "Number of Markers Written into File: 1: 8252993" [1] "Preparation for numeric data is done!" [1] "Preparation for PHENOTYPE data is done!" [1] "Output mvp genotype..." [1] "Calculate KINSHIP using Vanraden method..." [1] "Preparation for Kinship matrix is done!" [1] "Principal Component Analysis Start ..." means for first 10 snps: [1] 0 0 0 0 0 0 0 0 0 0 [1] "Preparation for PC matrix is done!" [1] "MVP data prepration accomplished successfully!" Warning message: In big.PCA(bigMat = big.geno, pcs.to.keep = pcs.keep) : selected too many PCs to keep [5], changing to 2
The genotype file (.map) contains only 3 columns and doesn't show the genotypes for each individual, just the Chr, SNP position and SNP ID.
Thanks for making this tool and for the help!