whatshap / whatshap

Read-based phasing of genomic variants, also called haplotype assembly
https://whatshap.readthedocs.io
MIT License
343 stars 40 forks source link

Error reading bgzipped reference #151

Open marcelm opened 6 years ago

marcelm commented 6 years ago

Original report by Tobias Marschall (Bitbucket: tobiasmarschall, GitHub: tobiasmarschall).


I just tried to use a bgzipped reference (see Issue #149), but failed:

$ whatshap phase --chromosome chr22 --reference hg38-capitalized.fasta --ignore-read-groups --output NA12878.chr22.phased-with-allcaps-ref.vcf.gz NA12878.chr22.unphased.vcf.gz /MMCI/TM/nanopore/work/jebler/paper_experiments/data/np2.fixed/NA12878.hg38.np2.chr22.bam

[WORKS AS EXPECTED]

$ bgzip -c hg38-capitalized.fasta > hg38-capitalized.fasta.gz

$ samtools faidx hg38-capitalized.fasta.gz

$ whatshap phase --chromosome chr22 --reference hg38-capitalized.fasta.gz --ignore-read-groups --output test.vcf.gz NA12878.chr22.unphased.vcf.gz /MMCI/TM/nanopore/work/jebler/paper_experiments/data/np2.fixed/NA12878.hg38.np2.chr22.bamThis is WhatsHap 0.14.1 running under Python 3.6.2
Working on 1 samples from 1 family
======== Working on chromosome 'chr22'
---- Processing individual NA12878
Using maximum coverage per sample of 15X
Number of variants skipped due to missing genotypes: 0
Number of remaining heterozygous variants: 31534
Reading alignments and detecting alleles ...
Traceback (most recent call last):
  File "/MMCI/TM/structvar/work/marschal/miniconda3/bin/whatshap", line 11, in <module>
    load_entry_point('whatshap==0.14.1', 'console_scripts', 'whatshap')()
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/whatshap/__main__.py", line 82, in main
    module.main(args)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/whatshap/phase.py", line 768, in main
    run_whatshap(**vars(args))
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/whatshap/phase.py", line 480, in run_whatshap
    readset, vcf_source_ids = read_reads(readset_reader, chromosome, phasable_variant_table.variants, bam_sample, fasta, phase_input_vcfs, numeric_sample_ids, phase_input_bam_filenames)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/whatshap/phase.py", line 120, in read_reads
    readset = readset_reader.read(chromosome, variants, sample, reference)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/whatshap/variants.py", line 65, in read
    reads = self._alignments_to_readdict(alignments, variants, sample, reference)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/whatshap/variants.py", line 119, in _alignments_to_readdict
    reference = reference[:]
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 698, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 887, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 532, in fetch
    seq = self.from_file(name, start, end)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/pyfaidx/__init__.py", line 572, in from_file
    self.file.seek(i.offset)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 615, in seek
    self._load_block(start_offset)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 577, in _load_block
    block_size, self._buffer = _load_bgzf_block(handle, self._text)
  File "/MMCI/TM/structvar/work/marschal/miniconda3/lib/python3.6/site-packages/Bio/bgzf.py", line 415, in _load_bgzf_block
    % (_bgzf_magic, magic, handle.tell()))
ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'd@\x7f\xde'; handle.tell() now says 25611
marcelm commented 6 years ago

I cannot find it right now, but when going through some of the pyfaidx issues, I read that for bgzipped references, the .fai created by pyfaidx is not compatible with the one created by samtools faidx. They somehow use different offsets.

marcelm commented 6 years ago

There it is: https://github.com/mdshw5/pyfaidx/issues/126