Open Lcornet opened 2 years ago
Does the output VCF of whatshap contain no actual lines, just the header? Can you maybe copy the command line output of whatshap and post it here? My suspicion is that the sample names in the BAM file and the VCF file do not match and whatshap then thinks that there are no reads for the sample to be phased. You could maybe add the option --ignore-read-groups
and see if this makes a difference.
No, the VCF of whatshap contains more the the header:
contig_131_pilon 80 . G A 4.19008e-05 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=6;DPB=6;DPRA=0;EPP=7.35324;EPPR=5 .18177;GTI=0;LEN=1;MEANALT=1;MQM=3;MQMR=18.25;NS=1;NUMALT=1;ODDS=11.7976;PAIRED=0;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=136;RO=4;RPL=2;RPP=7.35324;RPPR =5.18177;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=4;SRP=11.6962;SRR=0;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0/0:6:4,2:4:136:2:74:0,-1.23634,-5.72567
This is the log: whasthap-log.txt
ANI: Average Nucleotide Indetity: haplotype1.fasta haplotype2.fasta 99.701 9042 9156
Using the --ignore-reads (VFC and log are also at the end of the message):
contig_131_pilon 80 . G A 4.19008e-05 . AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=6;DPB=6;DPRA=0;EPP=7.35324;EPPR=5.1 8177;GTI=0;LEN=1;MEANALT=1;MQM=3;MQMR=18.25;NS=1;NUMALT=1;ODDS=11.7976;PAIRED=0;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=74;QR=136;RO=4;RPL=2;RPP=7.35324;RPPR=5.1 8177;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=4;SRP=11.6962;SRR=0;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0/0:6:4,2:4:136:2:74:0,-1.23634,-5.72567
ANI: Average Nucleotide Indetity: haplotype1.fasta haplotype2.fasta 99.7006 9048 9156
This is the log: whasthap-log-ignore-groups.txt
I don't see any error in WhatsHap log (both), is it normal that the two haployype are so close in terms of ANI ? The genome was assemble by flye using the long reads, some of the contigs should thus already be phased by flye.
The run without the --ignore-read-groups
flag indeed cannot match the sample names in VCF ("unknown") with any sample in the BAM file. It says
WARNING: Sample 'unknown' not found in any BAM/CRAM file. Found 0 reads covering 0 variants Kept 0 reads that cover at least two variants each
So, whatshap did not phase anything. Enabling the flag now actually starts the phasing. As for the ANI value: Do you count homozyguous positions as well? Because 99% sequence identity does not sound weird to me, if you compare the two haplotypes of a human sample, for instance. Does 9156 stand for all contigs or just for the first one (contig_131_pilon)?
I count all position including homozyguous. We don't know level of heterozygosity, so i'ts difficult to estimate the efficiency of the phasing.
The ANI are for all contigs.
So, as no warning appears in the log when --ignore-read-groups are used, i guess that whatshap works as expected and phase the genome.
I have a question concerning the recommended workflow.
I used short reads to create the input VCF file and the ont long reads in whatshap. I used the pipeline below but my haplotype a nearly the same, and i expect that something is strange.
My phased VCF don’t have aplotype information:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown
Could you explain me what is wrong with my pipeline ? Shoudl I run freebayes with another option ?
Reference geneome
bwa index reference.fasta
Mapping Nanopore on reference genome
bwa mem -x ont2d -t 20 reference.fasta FAQ55769_all.fastq > ont.sam samtools sort ont.sam -@ 20 -o ont-sort.bam samtools index ont-sort.bam
Mapping short reads on reference
bwa mem -t 20 reference.fasta F21-s3_NGS21-T685_AHJVHJDSX2_S148_L001_R1_001.fastq F21-s3_NGS21-T685_AHJVHJDSX2_S148_L001_R2_001.fastq > short_read_mapping.sam samtools sort short_read_mapping.sam -@ 20 -o short_read_mapping.bam samtools index short_read_mapping.bam
VCF caller
freebayes -f reference.fasta short_read_mapping.bam > var.vcf
phase WF
whatshap phase -o phased.vcf --reference=reference.fasta var.vcf ont-sort.bam
haplotype
bgzip phased.vcf tabix phased.vcf.gz bcftools consensus -H 1 -f reference.fasta phased.vcf.gz > haplotype1.fasta