Return only the variants interrogated by the --chromosome argument

marcelm commented 5 years ago

Original report by Fong Chun Chan (Bitbucket: [Fong Chun Chan](https://bitbucket.org/Fong Chun Chan), ).

When using the --chromosome argument, whatshap phase will still return the other variants on other chromosomes. Is there a way to have whatshap only return the positions it interrogated as specified by the --chromosome. For instance, if I had the following input.vcf file:

##fileformat=VCFv4.1
chr1  877831  . T C 40  PASS  mut_type=germline GT  1/1
chr1  877868  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881019  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881025  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881602  . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45812 . C A 40  PASS  mut_type=somatic  GT  0/1
chr2  45862 . G T 40  PASS  mut_type=somatic  GT  0/1
chr2  45875 . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45895 . A G 40  PASS  mut_type=germline GT  1/1
chr2  45946 . C T 40  PASS  mut_type=somatic  GT  0/1

Running:

whatshap phase \
        --reference ucsc.hg19-GATKBundle/fasta/uncompressed/ucsc.hg19-GATKBundle.fa \
        --ignore-read-groups \
        --indels \
        --chromosome chr1 \
        input.vcf \
        input.bam \
    > phased.vcf

Produces this phased VCF

##fileformat=VCFv4.1
##commandline="(whatshap 0.18) phase --reference /data/atx-pipeline-data/paris/genome/ucsc.hg19-GATKBundle/fasta/uncompressed/ucsc.hg19-GATKBundle.fa --ignore-read-groups --indels --chromosome chr1 test.vcf data/bams/amsim_pcawg_phaseable.sorted.cleaned.dups_remov
ed.bam"
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#hr1  877831  . T C 40  PASS  mut_type=germline GT  1/1
chr1  877868  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881019  . G T 40  PASS  mut_type=somatic  GT:PS 0|1:881019
chr1  881025  . G T 40  PASS  mut_type=somatic  GT:PS 0|1:881019
chr1  881602  . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45812 . C A 40  PASS  mut_type=somatic  GT  0/1
chr2  45862 . G T 40  PASS  mut_type=somatic  GT  0/1
chr2  45875 . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45895 . A G 40  PASS  mut_type=germline GT  1/1
chr2  45946 . C T 40  PASS  mut_type=somatic  GT  0/1

It only phases chromosome 1, but still returns the variants from chromosome 2. The reason I am interested in this is so that I can parallelize across chromosomes. Concatenating the results afterwards will be a lot easier if I knew that each run didn't contain all the original results.

Thanks in advance for your help.

marcelm commented 5 years ago

Original comment by Tobias Marschall (Bitbucket: tobiasmarschall, GitHub: tobiasmarschall).

We also usually parallelize by chromosome. This is something that we definitely should teach WhatsHap to do natively. In fact there's Issue #31 on this, sitting there since 2014. It would be easy to add an option to only output the requested chromosome. On the other hand, one can just pipe the output of WhatsHap directly into bcftools view to subset the variants to a chromosome, so I would prioritize working on a parallelized WhatsHap.

marcelm commented 5 years ago

The basic principle at the moment is that WhatsHap will take the input VCF, augment it with phasing information where it can and output it otherwise unchanged. I don’t think we should change this, and I also don’t think we should add an option to add the requested feature.

I agree we should prioritize issue #31 (parallelization) instead.

marcelm commented 5 years ago

Original comment by Fong Chun Chan (Bitbucket: [Fong Chun Chan](https://bitbucket.org/Fong Chun Chan), ).

Thanks both. This makes sense. Looking forward to the resolution of the parallelization issue.

whatshap / whatshap

Return only the variants interrogated by the --chromosome argument #185