zhanxw / rvtests

Rare variant test software for next generation sequencing data
129 stars 41 forks source link

vcf2kinship: Incorrect IBS calculation for missing data #92

Closed the-x-at closed 4 years ago

the-x-at commented 4 years ago

If a line in the VCF file contains missing data for an individual, the kinship calculation becomes incorrect. This can be reproduced with the following example input VCF file rvtest-oneline.vcf consisting of only a single variant, rs1, and three individuals, X, Y, and Z, where X has not been genotyped:

#CHROM  POS  ID  REF ALT QUAL FILTER INFO FORMAT  X    Y   Z
     1    1 rs1    C   A    .      .   PR     GT  .  0/1 0/0

Running vcf2kinship with a small tweak to accept a high missing rate

./rvtests/executable/vcf2kinship --inVcf rvtest-oneline.vcf --ibs --out rvtest-online --maxMiss 0.5

generates output file rvtest-online.kinship, which contains negative kinship values:

FID     IID     X       Y       Z
X       X       0       -8      -7
Y       Y       -8      2       1
Z       Z       -7      1       2

The problem is located in the method IBSKinship::addGenotype in file vcfUtils/vcf2kinship.cpp, where in the case of exactly one (1) missing genotype the kinship is still calculated for this pair of individuals, such that the value of MISSING_GENOTYPE, a constant set to -9 in libVcf/VCFConstant.h, is being used for kinship calculation in matrix k:

if (g[i] >= 0 || g[j] >= 0) {
  k[i][j] += 2.0 - abs((int)g[i] - (int)g[j]);
  ++count[i][j];
}

This behavior can be reproduced with the latest binaries (v2.1.0, Linux 64bit).

zhanxw commented 4 years ago

You're right. Thanks @the-x-at .