xiaolei-lab / rMVP

:postbox: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool For Genome-Wide Association Study
Apache License 2.0
277 stars 72 forks source link

Can combine multiple covariates and load as single covariates file #70

Open shameem356 opened 3 years ago

shameem356 commented 3 years ago

Hello Team rMVP,

First of all thank you so much for your wonderful software. I would like to clarify some doubt regarding multiple covariates. I have Five PCs in PC.txt file (pc1, pc2, pc3, pc4, pc5)and three scaling factor value in scf.txt file (scf1, scf2, scf3). Can I combine these two file into a single file (pc_scf.txt) and load as covaries file using below command ? if no, how can Iuse PC.txt and scf.txt file as covariates file ?

note : pc_scf.txt file is having 8 column ( pc1, pc2, pc3, pc4, pc5, scf1, scf2, scf3 ) MVP.Data.PC("pc_scf.txt", out="mvp.pc_scf",sep='\t') Covariates_PC <- bigmemory::as.matrix(attach.big.matrix("mvp.pc_scf"))

hyacz commented 3 years ago

Hello, You can simply use read.table to read your pc_scf.txt, and then use the model.matrix function to encode them, you can refer to the following code:

cv <- model.matrix(~as.numeric(pc1)+as.numeric(pc2)+as.numeric(pc3)+as.numeric(pc4)+as.numeric(pc5)+as.factor(scf1)+as.factor(scf2)+as.factor(scf3), data=pc_scf)

MVP(..., CV.GLM=cv, CV.MLM=cv, CV.FarmCPU=cv, nPC.GLM=0, nPC.MLM=0, nPC.FarmCPU=0, ...)

when you have calculated the PCs and put them into the CV.<model> parameter of the model, please set nPC.<model> to 0 to prevent the MVP from automatically adding PCs. MVP.Data.PC is used for principal component analysis, and its role is to obtain PCs from genotypes.

shameem356 commented 3 years ago

hello @hyacz , Thank you so much for your quick reply and code.I have updated my code as below based on your suggestion. Looking forward to see your suggestion.

converting plink file to rmvp

library(rMVP) MVP.Data(fileBed="199sample_HF", filePhe=NULL, fileKin=TRUE, filePC=FALSE,

priority="speed",

maxLine=10000,

out="mvp.199sample_HF" )

running FarmCPU GWAS

genotype <- attach.big.matrix("mvp.199sample_HF.geno.desc") phenotype <- read.table("179s_pheno.csv",head=TRUE) map <- read.table("mvp.199sample_HF.geno.map" , head = TRUE) Kinship <- attach.big.matrix("mvp.199sample_HF.kin.desc") pc_scf<- read.table("179s_PC5_scf_for_mvp.csv",head=TRUE) cv <- model.matrix(~as.numeric(PC1)+as.numeric(PC2)+as.numeric(PC3)+as.numeric(PC4)+as.numeric(PC5)+as.factor(SCF_Red)+as.factor(SCF_Green)+as.factor(SCF_Blue), data=pc_scf)

for(i in 2:ncol(phenotype)){ imMVP <- MVP( phe=phenotype[, c(1, i)], geno=genotype, map=map, K=Kinship, CV.FarmCPU=cv, nPC.FarmCPU=0, priority="speed", ncpus=16, vc.method="BRENT", maxLoop=10, method.bin="FaST-LMM",

permutation.threshold=TRUE,

    #permutation.rep=100,
    threshold=0.05,
    method=c("FarmCPU")

) gc() }

shameem356 commented 3 years ago

@hyacz ,

By running the above code , the log file is showing that 'Number of provided covariates of FarmCPU: 540'. 179s_PC5_scf_for_mvp.csv is having 179 samples ( 5 pcs+ 3 scaling factor value, 172* 8=1432 values ). I would like to know why 'Number of provided covariates of FarmCPU' is showing 540 ?

hyacz commented 3 years ago

Then the number of covariates mentioned in the log depends on the number of columns of variable cv. There are 3 factors (SCF_Red, SCF_Green, SCF_Blue). Since they have multiple levels, after processing by the model.matrix function, the number of columns in cv will be 540.

I'm not sure if I understand your data correctly. If SCF is a categorical variable, this is ok. If SCF is a quantitative variable, then as.numeric(SCF) should be used instead of as.factor(SCF) in model.matrix.

in addition, it should be noted that the order of individuals in cv needs to be consistent with the phenotype and genotype.