whatshap / whatshap

Read-based phasing of genomic variants, also called haplotype assembly
https://whatshap.readthedocs.io
MIT License
321 stars 37 forks source link

AssertionError with whatshap phase #224

Closed marcelm closed 4 years ago

marcelm commented 4 years ago

Original report by Weisheng Wu (Bitbucket: [Weisheng Wu](https://bitbucket.org/Weisheng Wu), ).


I’ve tried to run whatshap on a vcf with pedigree and bams, but it failed with an “AssertionError“. Could someone take a look at my command line and the log and tell me where the problem is? Thanks.

My command line is:

whatshap phase -o DNA.phased.chr1.vcf --reference Homo_sapiens_assembly38.fasta --indels --chromosome chr1 --use-ped-samples --ped trio_FAM001.ped --genmap plink.chr1.GRCh38.map DNA.unphased.chr1.vcf Sample_2296_recal.bam Sample_2295_recal.bam Sample_4166_recal.bam Sample_0636_recal.bam Sample_0636_recal.bam 2>&1|tee >whashap_phase_chr1.log

DNA.phased.chr1.vcf“ is a vcf that contains chr1 variants in 600 samples. I’m using “--use-ped-samples“ so whatshap should only phase the samples in the pedigree file. “trio_FAM001.ped“ is the pedigree file that contains 5 samples (one father, one mother and three children):

FAM001 sample_2296 sample_0636 sample_0636 1 0
FAM001 sample_2295 sample_0636 sample_0636 2 0
FAM001 sample_4166 sample_0636 sample_0636 1 0
FAM001 sample_0636 0 0 1 0
FAM001 sample_0636 0 0 2 0

Here is part of the log:

Reading alignments for sample 'sample_2296'and detecting alleles ...
[W::hts_idx_load2] The index file is older than the data file: Sample_2296_recal.bai
Found 40265 reads covering 497 variants
Kept 7989 reads that cover at least two variants each
Reducing coverage to at most 3X by selecting most informative reads ...
Selected 337 reads covering 321 variants
Variants covered by at least one phase-informative read in at least one individual after read selection: 338
Variants either covered by phase-informative read or homozygous in at least one individual: 489
Traceback (most recent call last):
File "/home/weishwu/.conda/envs/whatshap/bin/whatshap", line 12, in
sys.exit(main())
File "/home/weishwu/.conda/envs/whatshap/lib/python3.6/site-packages/whatshap/main.py", line 83, in main
module.main(args)
File "/home/weishwu/.conda/envs/whatshap/lib/python3.6/site-packages/whatshap/phase.py", line 1114, in main
run_whatshap(**vars(args))
File "/home/weishwu/.conda/envs/whatshap/lib/python3.6/site-packages/whatshap/phase.py", line 847, in run_whatshap
recombination_costs = recombination_cost_map(load_genetic_map(genmap), accessible_positions)
File "/home/weishwu/.conda/envs/whatshap/lib/python3.6/site-packages/whatshap/pedigree.py", line 106, in load_genetic_map
assert len(line_spl) == 3
AssertionError

marcelm commented 4 years ago

Your genetic map file (--genmap) is not in the format that WhatsHap needs. Does your file follow the format as described in the documentation?

We definitely need to improve the error message here.

marcelm commented 4 years ago

Print proper error messages when genetic map file cannot be parsed

Closes #224

marcelm commented 4 years ago

This has now been fixed in the development version. This is how it looks now:

$ whatshap phase --ped=tests/data/trio.ped --genmap=invalid.map --chromosome=1 -o /dev/null tests/data/trio.vcf trio.pacbio.bam
This is WhatsHap 0.19.dev154+g8359d1d.d20200221 running under Python 3.7.5
Using region-specific recombination rates from genetic map invalid.map.
ERROR: whatshap error: Error at line 2 of genetic map file 'invalid.map': Found 4 fields instead of 3

Your AssertionError could also arise if there were any empty lines in the file (such as at the end). This is now also fixed: Empty lines are simply ignored.

marcelm commented 4 years ago

Original comment by Weisheng Wu (Bitbucket: [Weisheng Wu](https://bitbucket.org/Weisheng Wu), ).


Indeed my genetic map is in wrong format. Sorry I didn’t notice it:

$ head plink.chr1.GRCh38.map
1 . 0 55550
1 . 0.080572 82571
1 . 0.092229 88169
1 . 0.439456 285245
1 . 1.478148 629218
1 . 1.478214 629241
1 . 1.480558 630053
1 . 1.488889 632942

Where can I download the correctly formatted hg38 genetic map? I’ve looked at shapeit and beagle. They either don’t have recombination rate or only have hg19.

marcelm commented 4 years ago

Original comment by Tobias Marschall (Bitbucket: tobiasmarschall, GitHub: tobiasmarschall).


We don’t provide a ready-made genetic map right now. Sorry. I’m afraid you’d have to create this yourself. In my experience however the genetic map for read-based trio phasing is much less important than for statistical phasing. Therefore we rarely use this option ourselves and just run with a uniform recombination rate (i.e. just omit --genmap). In a way, that can even have advantages because it’d detect recombination in an unbiased manner.

marcelm commented 4 years ago

Original comment by Weisheng Wu (Bitbucket: [Weisheng Wu](https://bitbucket.org/Weisheng Wu), ).


Thanks. I’ll try it without a genetic map file.