Similarity error count discrepancy with other aligners

cssulliv commented 3 years ago

Hi,

Summary: I had noticed a discrepancy in the quantified basepair differences between 2 metagenome-assembled genomes (MAGs) by pyANI via the Anvi'o software using the ANIm method when compared to two other genome aligners. So this is not a system error per se but I am seeking to better understand what ANIm considers as a error/basepair difference between 2 genomes.

Description: I had used the ANIm method in pyANI v0.2.7, which is implemented in Anvio v6.1, to identify differences in basepairs among 2 highly similar MAGs. I had separately used MAUVE and MAFFT in Geneious to compare basepair difference results. When I inspected the ANIm_similarity_errors.txt I noticed that pyANI had quantified an extra basepair difference (64bp total differences), while in MAUVE and MAFFT there were only 63bp differences between the 2 genomes. I wanted to know if pyANI computes for basepair difference/sequence similarity a certain way that might have caused the discrepancy?

Reproducible steps/ script used: Again, this was run through anvio and I had simply used their default settings, which can be found here :) https://merenlab.org/software/anvio/help/main/programs/anvi-compute-genome-similarity/

anvi-compute-genome-similarity --internal-genomes JdFR_HS_internal_genomes_ver1.txt --program pyANI --method ANIm --output-dir pyANI_ANIm_HS -T 2 --pan-db HIGH-SIMILARITYver1/JdFR_HIGH_SIM_ver1/JdFR_HIGH_SIM_ver1-PAN.db

pyani Version: v0.2.7

Python Version: Python 3.6.10

Operating System: 10.14.6

Thank you in advance for the assistance and clarification :)

baileythegreen commented 3 years ago

Hi @cssulliv

Thank you for your interest in pyani! The person who implemented ANIm in pyani is currently on holidays; they would normally field your question, but I'll do what I can to help.

Can you share your input data so I can try to reproduce the output? I believe you should be able to upload a .txt file here. I'll try a few things first, and then may ask you to run stuff on your end, if needed.

Similarity errors in `pyani`'s ANIm

As for how ANIm implemented in pyani is calculating the similarity errors, this information is largely being pulled from the Nucmer .delta files. From the MUMmer manual, the line following a sequence header is described as:

The four coordinates are the start and end in the reference and the start and end in the query respectively. The three digits following the location coordinates are the number of errors (non-identities + indels), similarity errors (non-positive match scores), and stop codons (does not apply to DNA alignments, will be "0"). An example header might look like:
2631 3401 2464 3234 15 15 2

What pyani's implementation of ANIm does to compute similarity errors is pull the fifth number from each such line, and sum them. (You can see this on line 164 here; keeping in mind that python is 0-indexed.) These are the numbers that according to the MUMmer documentation represent number of errors (non-identities + indels), rather than similarity errors (non-positive match scores). It is possible that this is the source of the difference you see (but I would like to test this further).

A similarity error will always be a non-identity, but a non-identity is not always a similarity error (ambiguous nucleotides can be considered similar, but not identical); the two numbers might be the same, or the number of non-identities could be higher (and the number in the .delta file—and thus, pyani's output—higher again if there are also indels to consider). I would venture to guess that the choice was made to use the number of non-identities in the pyani output as this would be more conservative, but I will try to find out for sure.

baileythegreen commented 3 years ago

If you have a .delta file from the run, that would be useful, too.

widdowquinn commented 5 months ago

Closed due to lack of activity.

widdowquinn / pyani