whatshap / whatshap

Read-based phasing of genomic variants, also called haplotype assembly
https://whatshap.readthedocs.io
MIT License
340 stars 40 forks source link

Connecting different phase sets with each other #243

Open erikvdp opened 4 years ago

erikvdp commented 4 years ago

First of all, thank you for this amazing piece of open source software! I have been using Whatshap for my own researching purposes, and I am investigating if I could leverage this software to achieve better phasing results with some genomic ONT sequencing data. If creating a Github issue is not the right place to ask for help, I am happy to remove this issue. Otherwise it would be great if someone could help me with the following problem:

Due to the uniqueness of the sequencing protocol, I have one 'read' (let's call it a superread) that consists out of multiple long reads (let's call this subreads) spaced quite far from each other but belonging to the same chromosome. I think the technology is quite similar to the 10X Genomics. I also make use of the 'BX' tag to group the multiple subreads together.

We have phased our mapped subreads using Whatshap and have verified that by making use of the 'BX' tag and some slight modifications, whatshap achieves superior results. Now I would like to be able to do some postprocessing and connect (phase) different phase sets which each other, ideally arriving at whole chormosome haplotyping. In theory this should be partly possible, because for one "superread", we have multiple subreads that span different phase sets.

My initial plan was to see if we can start by looking at the 'whatshap haplotag' subcommand, because in that code there seems to be a check if a read belongs to multiple phase sets.

My question is if somebody has a better idea about how to approach this?

Thanks in advance,

tobiasmarschall commented 4 years ago

Dear Erik, thanks for the praise, great that WhatsHap is useful to you! Your scenario sounds somewhat similar to what we've done previously in these papers: http://dx.doi.org/10.1038/s41467-017-01389-4 and http://dx.doi.org/10.1038/s41467-018-08148-z There, we represented long-range phase information (e.g. from Strand-seq or Hi-C) as a VCF where only a small subset of variants is phased, but over very long genomic distances. Such a VCF can then be combined with other sources of phase information such as either long reads or other such VCFs. I'm not sure I fully understand your process, but you might try to use the phased VCF you obtain as PHASEINPUT in a second iteration (i.e. give that VCF in the place on the command line where you typically include a BAM file), while providing additional phase information. Does that help? Best, Tobias