zjshi / gt-pro

MIT License
23 stars 7 forks source link

GT-Pro genotype fails on reads with ambiguous bases other than "N" #57

Open bsmith89 opened 1 year ago

bsmith89 commented 1 year ago

I get a fairly ambiguous error sometimes when I run gzip -dc input.fq.gz | GT_Pro genotype -t 1 -l 32 -m 36 -d ref/gtpro/20190723_881species > output.gtpro_unparsed.tsv:

gt_pro  ref/gtpro/20190723_881species   1       no_overwrite
1684513662641:  [Info] Starting to load DB: ref/gtpro/20190723_881species
1684513662643:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_snps.bin
1684513662897:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_kmer_index.bin
1684513663999:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_mmer_bloom_36.bin
1684513664822:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_lmer_index_32.bin
1684513668293:  [Info] Done with init for optimized DB with 2856121626 kmers.  That took 5 seconds.
1684513668293:  [Info] Will input reads from stdin and output snps to stdout.
1684513668293:  [Info] Output will appear only after stdin reaches EOF.
1684513668302:  [Info] Waiting for all readers to quiesce
1684513668436:  [Done] searching is completed for the 0 reads input from /dev/stdin
[ERROR] Failed to parse somewhere past position 0 in presumed FASTQ file /dev/stdin
1684513668436:  [ERROR] Failed to parse somewhere past position 0 in presumed FASTQ file /dev/stdin
1684513668437:  0 million reads were scanned after 0 seconds
*** Failed for ALL 1 input files. ***
1684513668439:  Totally done: 0 seconds elapsed processing reads, after DB was loaded.

When I replace all characters in sequences that are not [ACGT] with N, this error goes away, and the output looks like this:

gt_pro  ref/gtpro/20190723_881species   1       no_overwrite
1684513745912:  [Info] Starting to load DB: ref/gtpro/20190723_881species
1684513745912:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_snps.bin
1684513746163:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_kmer_index.bin
1684513747263:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_mmer_bloom_36.bin
1684513748086:  [Info] MMAPPING ref/gtpro/20190723_881species_optimized_db_lmer_index_32.bin
1684513751568:  [Info] Done with init for optimized DB with 2856121626 kmers.  That took 5 seconds.
1684513751568:  [Info] Will input reads from stdin and output snps to stdout.
1684513751568:  [Info] Output will appear only after stdin reaches EOF.
1684513751576:  [Info] Waiting for all readers to quiesce
1684513753544:  [Done] searching is completed for the 10883 reads input from /dev/stdin
1684513753583:  [Stats] 80119 snps, 10883 reads, 1.03 hits/snp, for /dev/stdin
1684513753583:  0.01 million reads were scanned after 2 seconds
1684513753583:  Successfully processed 1 input files containing 10883 reads.
1684513753586:  Totally done: 2 seconds elapsed processing reads, after DB was loaded.

In this case I'm "metagenotyping" tiles from an assembled genome, and I realize that ambiguous bases don't generally occur in real reads, but I figured either failing with a more informative error message or integrating the normalization [^ACGT] -> N might be a fairly straight-forward feature enhancement.