zhanxw / rvtests

Rare variant test software for next generation sequencing data
133 stars 42 forks source link

wald single test bug? #30

Closed ppjeep closed 7 years ago

ppjeep commented 7 years ago

Dear Dr. Zhan, Thanks for providing us such an excellent program, rvtests. 1) I used 6 covariates (age,sex,mds1,mds2,mds3,mds4) in wald single test in rvtests. It seems rvtest reports covariates in the weired order in the output file. For example, age is missing and mds4 was reported twice.

command

rvtest --inVcf ${VCF} --out ${OUT}.single --single wald --numThread 8 --pheno ${phenF} --pheno-name zud5 --covar ${covF} --covar-name age,sex,mds1,mds2,mds3,mds4

output

CHROM POS REF ALT N_INFORMATIVE Test Beta SE Pvalue 1 762485 C A 3018 1:762485 0.000704894 0.0564456 0.990036 1 762485 C A 3018 sex -0.0130407 0.00244367 9.47384e-08 1 762485 C A 3018 mds1 -0.809614 0.0776035 1.75835e-25 1 762485 C A 3018 mds2 -8.40048 2.10399 6.53407e-05 1 762485 C A 3018 mds4 4.49323 11.3983 0.693432 1 762485 C A 3018 mds3 2.85347 13.3588 0.830857 1 762485 C A 3018 mds4 9.50253 13.3434 0.47637

2) Could you please clarify the missing value in phenotype/covariate files in rvtest? In the document of "Single variant tests", it states that "for binary trait, the recommended way of coding is to code controls as 1, cases as 2, missing phenotypes as -9 or 0". In the description of phenotype file, it states that "In phenotype file, missing values can be denoted by NA or any non-numeric values". In covariate file part, it states that "Missing data in the covariate file can be labeled by any non-numeric value (e.g. NA)". Does "NA" always indicate missing in any phen/cov file? Is there a different definition of missing in binary/quantitative traits?
3) It would be very helpful if rvtests could generate Manhatten plot and QQ plot by default. Thank you very much!

zhanxw commented 7 years ago

Thanks for these very helpful feedbacks.

  1. I plan to enforce the order of outputted covariates.
  2. Yes, "NA" always indicates missing. I intend to use "-9" and "0" in the phenotype file for binary traits to be compatible with PLINK.
  3. That's doable. I plan to provide R scripts to generate Manhattan plots and QQ plots.
ppjeep commented 7 years ago

mds4 (one of the covariates) was reported twice with DIFFERENT results while one of the variables (age) is missing. I am wondering if the model has bugs (not only the order of outputs). Thanks!

zhanxw commented 7 years ago

I see the problem now. But I cannot replicate this problem - I tried to create an example with 6 covariates as you did and the result file looks fine (no duplicated covariates). Is it possible you can provide a minimal example? Thanks.

zhanxw commented 7 years ago

Please also let me know the version you used, if this problem persists. Thanks.

ppjeep commented 7 years ago

I am using the latest version (20170418). I think you can replicate this problem when you use binary trait with a lot of (e.g.,>50%) missing value (-9). I can also send you an example if you send me your email address. Here is my email: zhanght99@yahoo.com

ppjeep commented 7 years ago

BTW, it would be very helpful if rvrest use --missing option to specify different types of missing values. Thanks!

zhanxw commented 7 years ago

I agree that --missing-phenotype can be helpful. RVTESTS current takes -9 and 0 as missing in the binary trait mode. Allow users to specify other missing values can be an added feature.

zhanxw commented 7 years ago

This latest version should solve the problem in $prefix.SingleWald.assoc files: http://zhanxw.com/rvtests/experimental/rvtests-20170613-01e018-linux64-static.tar.gz