whatshap / whatshap

Read-based phasing of genomic variants, also called haplotype assembly
https://whatshap.readthedocs.io
MIT License
331 stars 38 forks source link

No phasing info in vcf #363

Open Lcornet opened 2 years ago

Lcornet commented 2 years ago

I have a question concerning the recommended workflow.

I used short reads to create the input VCF file and the ont long reads in whatshap. I used the pipeline below but my haplotype a nearly the same, and i expect that something is strange.

My phased VCF don’t have aplotype information:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown

Could you explain me what is wrong with my pipeline ? Shoudl I run freebayes with another option ?

Reference geneome

bwa index reference.fasta

Mapping Nanopore on reference genome

bwa mem -x ont2d -t 20 reference.fasta FAQ55769_all.fastq > ont.sam samtools sort ont.sam -@ 20 -o ont-sort.bam samtools index ont-sort.bam

Mapping short reads on reference

bwa mem -t 20 reference.fasta F21-s3_NGS21-T685_AHJVHJDSX2_S148_L001_R1_001.fastq F21-s3_NGS21-T685_AHJVHJDSX2_S148_L001_R2_001.fastq > short_read_mapping.sam samtools sort short_read_mapping.sam -@ 20 -o short_read_mapping.bam samtools index short_read_mapping.bam

VCF caller

freebayes -f reference.fasta short_read_mapping.bam > var.vcf

phase WF

whatshap phase -o phased.vcf --reference=reference.fasta var.vcf ont-sort.bam

haplotype

bgzip phased.vcf tabix phased.vcf.gz bcftools consensus -H 1 -f reference.fasta phased.vcf.gz > haplotype1.fasta

schrins commented 2 years ago

Does the output VCF of whatshap contain no actual lines, just the header? Can you maybe copy the command line output of whatshap and post it here? My suspicion is that the sample names in the BAM file and the VCF file do not match and whatshap then thinks that there are no reads for the sample to be phased. You could maybe add the option --ignore-read-groups and see if this makes a difference.

Lcornet commented 2 years ago

No, the VCF of whatshap contains more the the header:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown

contig_131_pilon 80 . G A 4.19008e-05 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=6;DPB=6;DPRA=0;EPP=7.35324;EPPR=5 .18177;GTI=0;LEN=1;MEANALT=1;MQM=3;MQMR=18.25;NS=1;NUMALT=1;ODDS=11.7976;PAIRED=0;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=136;RO=4;RPL=2;RPP=7.35324;RPPR =5.18177;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=4;SRP=11.6962;SRR=0;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0/0:6:4,2:4:136:2:74:0,-1.23634,-5.72567

This is the log: whasthap-log.txt

ANI: Average Nucleotide Indetity: haplotype1.fasta haplotype2.fasta 99.701 9042 9156

Using the --ignore-reads (VFC and log are also at the end of the message):

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown

contig_131_pilon 80 . G A 4.19008e-05 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=6;DPB=6;DPRA=0;EPP=7.35324;EPPR=5.1 8177;GTI=0;LEN=1;MEANALT=1;MQM=3;MQMR=18.25;NS=1;NUMALT=1;ODDS=11.7976;PAIRED=0;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=136;RO=4;RPL=2;RPP=7.35324;RPPR=5.1 8177;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=4;SRP=11.6962;SRR=0;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0/0:6:4,2:4:136:2:74:0,-1.23634,-5.72567

ANI: Average Nucleotide Indetity: haplotype1.fasta haplotype2.fasta 99.7006 9048 9156

This is the log: whasthap-log-ignore-groups.txt

I don't see any error in WhatsHap log (both), is it normal that the two haployype are so close in terms of ANI ? The genome was assemble by flye using the long reads, some of the contigs should thus already be phased by flye.

schrins commented 2 years ago

The run without the --ignore-read-groups flag indeed cannot match the sample names in VCF ("unknown") with any sample in the BAM file. It says

WARNING: Sample 'unknown' not found in any BAM/CRAM file. Found 0 reads covering 0 variants Kept 0 reads that cover at least two variants each

So, whatshap did not phase anything. Enabling the flag now actually starts the phasing. As for the ANI value: Do you count homozyguous positions as well? Because 99% sequence identity does not sound weird to me, if you compare the two haplotypes of a human sample, for instance. Does 9156 stand for all contigs or just for the first one (contig_131_pilon)?

Lcornet commented 2 years ago

I count all position including homozyguous. We don't know level of heterozygosity, so i'ts difficult to estimate the efficiency of the phasing.

The ANI are for all contigs.

So, as no warning appears in the log when --ignore-read-groups are used, i guess that whatshap works as expected and phase the genome.