whatshap / whatshap

Read-based phasing of genomic variants, also called haplotype assembly
https://whatshap.readthedocs.io
MIT License
343 stars 40 forks source link

Phasing human pseudoautosomal regions (PAR1 and PAR2) #424

Open ekirving opened 1 year ago

ekirving commented 1 year ago

I have a callset containing human X chromosomes, in which the males have been called with a ploidy of 2 inside the pseudoautosomal regions (PAR1 and PAR2) and ploidy of 1 for the rest of chrX.

Calling whatshap phase on these samples results in the following error:

whatshap.vcf.PloidyError: Inconsistent ploidy (2 and 1)

Is there a way to turn off this error in whatshap, as the region specific ploidy is both correct and expected in chrX for males?

My planned workaround is to extract the PAR regions into separate files, phase with whatshap, then merge the VCFs back together. But it would be much simpler and easier if I could set a flag to tell whatshap not to throw an error when encountering haploid genotypes.

ekirving commented 1 year ago

An even better solution would be to support a --ploidy-file parameter, like bcftools does, to specify sex-dependent ploidy levels, both across and within chromosomes.

ekirving commented 1 year ago

Having now tried phasing the PAR and non-PAR regions of chrX separately, I can confirm that this approach does not work. I had assumed that the whatshap.vcf.PloidyError was because the ploidy changed along the chromosome for male samples, however, whatshap will throw this error if any of the samples in the input VCF have a different ploidy level. As such, there does not appear to be a way to trio phase chrX, as the father is haploid and the mother is diploid in the non-PAR regions. Is this expected behaviour, or am I missing something?

marcelm commented 1 year ago

Hi, thanks for reporting. I haven’t thought this through, but would just forcing the genotype to 0/0 for the haploid regions be a workaround?

WhatsHap is indeed currently not capable of working with multiple ploidies in the same file. I haven’t worked on that part of the code in a long time, but I believe that changing the assumption that everything is diploid (for whatshap phase) requires quite some work. Perhaps not so many code changes are required, but understanding the existing code and finding all the places where the assumption is made would be the biggest part. (I wish we had some funding for projects like this.)

--ploidy-file sounds like a great feature, thanks for pointing that out.

ekirving commented 1 year ago

Hi Marcel, thanks for your suggestion. That sounds like it might work for trio phasing a female child, but it would not work for a male child (because they don't inherit their chrX from their father). In practice, this workaround would be quite fiddly to implement, as my pedigree also contains some multigenerational families, and there would be loss of phase information for the mother if all male children were excluded.

tobiasmarschall commented 1 year ago

Thanks for reporting and sorry I had missed this. I can add that we just encountered a use case where we'd also like whatshap haplotag to handle the PARs properly. So I think working in this would be good, but I agree it's probably a bit more involved to do it right. CC @samarendra-pani