widdowquinn / scripts

Miscellaneous scripts
MIT License
22 stars 19 forks source link

perc_align values of >100% #5

Closed jvollme closed 8 years ago

jvollme commented 9 years ago

Hi, I'm trying to use your calculate_ani.py script to decide if different single-cell genomes might belong to the same species. However, when using nucmer (ANIm) i get very strange results for perc_align: here one comparison organism seems to have >200% of its fragments aligned.

Here are the results for "perc_aln.tab" using the method "ANIm":

# calculate_ani.py Tue Jul  7 11:52:18 2015
# Minimum % aligned nt
    SCG1_larger500  SCG2_larger500
SCG1_larger500  NA  2.10339806913
SCG2_larger500  0.341926838682  NA

Apparently more than 200% of SCG1_larger500 aligned to SCG2_larger500. How is that possible? in reverse, only 34% of SCG2_larger500 aligned to SCG1_larger500

for comparison, here are the results for "perc_aln.tab" using the method "ANIb":

# calculate_ani.py Tue Jul  7 11:51:57 2015
# Minimum % aligned nt
    SCG1_larger500  SCG2_larger500
SCG1_larger500  NA  0.094639697153
SCG2_larger500  0.0153845593644 NA

Here, only 1-10% of each sequence-fragments aligned to each other (which seems way more realistic)

Do you have any Idea what the problem might be?

widdowquinn commented 9 years ago

Hi John,

The first thing I should say is that this script is about to be deprecated - you would be better off using pyani, which can be found here: https://github.com/widdowquinn/pyani (see also https://github.com/widdowquinn/scripts/issues/4).

I can't really say why your data is giving the result it is without seeing either the sequences, or the MUMmer output directly. The perc_aln results are just \frac{total alignment length}{total length of organism sequence}, so a value greater than unity implies that the total alignment length reported is greater than the length of sequence for the (target) organism being aligned. My first thought is that could be a misparsing of the output, or because the alignment algorithm reports multiple query locations matching to the same target organism location(s). Unfortunately, I can't tell which it is (or if it is something else - e.g. if you have very short sequences and the number of gaps inserted into one or other alignment gives this result) from the output you provide.

It would be worth trying the same comparison in pyani, and seeing if you get the same result. If you can read/view the MUMmer output for that comparison, can you tell whether you have multiple matches to the same location (I don't think this ought to be the case with the settings the script uses), or whether it looks like the script is misparsing the file. If that doesn't help, I would need to see at least verbose logging output from the script, and potentially your MUMMer output and/or input sequences to diagnose properly.

Cheers,

L.