Does bgc-hm handle missing data?

YingChen94 commented 4 months ago

Hello, can I use SNPs with some missing data in some individuals? Or shall I use SNPs without missing data across all samples? Thank you in advance!

zgompert commented 4 months ago

BGC-HM handles missing data via the genotype likelihood model. This can be used to specify any level of (un)certainty in genotypes. Often with NGS data, missing data is not binary, but rather there are different degrees of uncertainty based on the number and quality of reads. If you want to treat missingness as a binary (either perfect knowledge or no information) you can use the genotype likelihood model and specify known genotypes by giving relative likelihoods (basically probabilities) of 1 for the known genotype at each locus (and thus 0 for other possible genotypes) and then assigning relative likelihoods of 1/3 to each of the three possible genotypes when you have missing data. An alternative would be to specify relative likelihoods for missing genotypes based on prior expectations from population allele frequencies (i.e., higher likelihoods for homozygotes harboring a common allele than homozygotes for the rarer allele, etc.). If there is sufficient interest, I can add an option to actually drop/skip some loci for some individuals in the future.

YingChen94 commented 4 months ago

Ok good to know. Thank you Dr. Gompert!

zgompert / bgc-hm

Does bgc-hm handle missing data? #2