widdowquinn / scripts

Miscellaneous scripts
MIT License
22 stars 19 forks source link

mummer error in calculate_ani.py #2

Closed wwood closed 9 years ago

wwood commented 10 years ago

Hi,

I sometimes use your script calculate_ani (thanks..) for comparing genomes that may or may not have any sequence overlap, and I got this error:

INFO: Processing .delta files
INFO: Processing bins_ani/bin1.contigs_vs_BIN_882.delta
INFO: Query organism: bin1.contigs; Subject organism: BIN_882
ERROR: One or more of the NUCmer output files contains no useable output.
ERROR: One or more NUCmer runs failed. Please investigate.
ERROR: Please retry the NUCmer comparison for bin1.contigs vs BIN_882 manually
ERROR: Traceback (most recent call last):
  File "/srv/sw/widdowquinn_scripts/20140817_git_version_56613/bioinformatics/calculate_ani.py", line 692, in process_delta
    perc_id = 1 - 1. * tot_sim_error/tot_length
ZeroDivisionError: float division by zero

Traceback (most recent call last):
  File "/srv/sw/widdowquinn_scripts/20140817_git_version_56613/bioinformatics/calculate_ani.py", line 1088, in <module>
    methods[options.method]()
  File "/srv/sw/widdowquinn_scripts/20140817_git_version_56613/bioinformatics/calculate_ani.py", line 286, in calculate_anim
    lengths, sim_errors, perc_ids, perc_aln = process_delta(org_lengths)
  File "/srv/sw/widdowquinn_scripts/20140817_git_version_56613/bioinformatics/calculate_ani.py", line 713, in process_delta
    perc_ids[(qname, sname)] = perc_id
UnboundLocalError: local variable 'perc_id' referenced before assignment

After a bunch of time spent digging the problem boiled down to having a genome where the names in the fasta file were too long, and this is known to cause problems - http://sourceforge.net/p/mummer/mailman/mummer-help/thread/4C9CDBBB.1060705@bcgsc.ca/

Would it be possible to provide information on which mummer run failed (why not just raise an error when exitstatus != 0 and give the cmdline?), rather than letting the program continue, it it fails later on anyway?

While I'm here, a third file output which is simply perc_id * perc_aln would be handy - of course this is easy to create myself in a spreadsheet after the fact, though, so I'm not requesting this very ardently...

Thanks, ben

widdowquinn commented 10 years ago

Thanks Ben,

Thanks for using calculate_ani.py, and I'm glad to hear that you (usually) find it useful.

The reason why the program just continues is because all the command line jobs are scheduled using an asynchronous pool in multiprocessing. This makes it harder to get at subprocess exceptions than usual for two main reasons: the processes are not necessarily called in order, and there's extra work in rerouting exceptions through the subprocessing - they're not readily available through the multiprocessing interface. It's not impossible, but I didn't implement it because it was as easy (for me) to inspect the output to identify which job failed, and rerun the MUMmer code to determine the error directly. I'll look into it further, though.

Why not create a third file with perc_id * perc_aln? Mainly because I didn't see the value - it's not clear what the output means. If there's an entry of 0.75 in a cell of that table, what does it mean? 100% identity, and 75% coverage, or the reverse? 80% coverage, and 94% identity, or the reverse? and so on. I don't see that it's a very meaningful thing to calculate, and I don't want to encourage calculation of values that I don't see as useful ;)

Best wishes,

L.

widdowquinn commented 9 years ago

Closed as thread went silent.

wwood commented 9 years ago

Thanks for the response tho.