zhou-lab / biscuit

BISulfite-seq CUI Toolkit
Other
62 stars 24 forks source link

SNP calling and known genotypes (question + feature suggestion) #16

Open jdidion opened 7 years ago

jdidion commented 7 years ago

I have genotype array, WGS, and WGBS data for my samples. I am using this information to detect sample swaps. I find that Biscut genotype calls are highly concordant with WGS genotype calls except for the case where the reference is 'C' and the true genotype is 'TT'. I understand that it is not possible to accurately genotype in this case, but I am curious about the behavior of Biscuit. For example:

From the pileup, there is no evidence of a C allele: chr1 852875 C 60 TTTTTTTTTTTTtttttttTTTTTTTtttTTTTttTTTttTTttTTTTtttTTTTTtttt

However, in the VCF, the allele support for this position shows 33 Cs and 26 Ts: chr1 852875 . C T,A,G 34 PASS . DP:GT:GP:GQ:SP:CV:BT 60:0/1:84,4,115:99:C33,T26,A0:.:

Question: In this case, does Biscuit just generate the 'C' count from an expected distribution?

My suggestion is that a nice feature would be detecting sample swaps when genotype information is known. Basically just a script that compares a VCF of known genotypes to the Biscuit-generated VCF, ignoring sites where it is difficult/impossible to genotype correctly from WGBS, and output a likelihood score of the two VCFs having been generated from the same individual.

ttriche commented 7 years ago

bcftools gtcheck will do this, if the VCFs are valid v4.1. I'm going to take a whack at that

ttriche commented 7 years ago

6fc5d23682bc155c7834f3455a92886fb59dea74 fixes the VCF issue (don't look at how trivial the fix was, you will feel bad, I did). bcftools csq now works on the generated VCF files; bcftools gtcheck should too.