widdowquinn / pyani

Application and Python module for average nucleotide identity analyses of microbes.
http://widdowquinn.github.io/pyani/
MIT License
197 stars 55 forks source link

High ANI low coverage? #139

Closed elzerac closed 5 years ago

elzerac commented 5 years ago

I would really appreciate some help at interpreting the PYANI output.

METHODS: I am aligning some genomes I sequenced from pure isolates to metagenomic assembled genomes from other studies. I assembled my genomes using SPAdes and the studies I'm looking at are assembled by metaSPAdes. My genomes are still in contigs as are the metagenomic contigs. I am using ANIm method.

QUESTION: I am getting very high identity matches by ANI (>99% and in some cases 100%) but the genome coverage is very low. In other cases, I am getting very high coverage but low identify. Genome coverage and coverage length is usually concordant (i.e. if coverage is low, length is low as well). Could you explain in what cases the discordance between ANI and coverage may occur?

Related to this:

1- is it possible for 100% ANI sequences to be found in two different individuals? ie the same exact strain?

2- Why do I see differences in ANI when I align genome A to genome B versus genome B to genome A?

3- How come sometimes I get coverage >100%?

Thank you so much for your help. Best,

Elze

widdowquinn commented 5 years ago

Hi Elze,

There's quite a bit to unpack, there… these are questions that get right to the heart of the applicability of these approaches, and should make you a little sceptical when you read the literature ;)

Firstly, ANI (all methods) calculates percentage identity for homologous regions shared by pairs of genomes. If the homologous regions are short with respect to the the total length of one or other genome (e.g. a lateral gene transfer event between distantly-related bacteria), then ANI can be high even though there is low coverage. There are other potential causes for this kind of observation, and in general I would advise that ANI %identity must be interpreted in conjunction with %coverage. We describe this in the paper in CITATIONS. I typically use a coverage cutoff of 50% to indicate relatedness worth investigating.

"is it possible for 100% ANI sequences to be found in two different individuals? ie the same exact strain?"

If you sequenced the same isolate twice, I'd expect the ANI of the pairwise comparison to be at or near 100%. If you sequenced two isolates that happened to be clonal, or near-clonal, I'd expect the same. In practice, there's always a bit of "wobble"

I'm not using the word "strain" here because there can be some ambiguity over what is meant by that.

"Why do I see differences in ANI when I align genome A to genome B versus genome B to genome A?"

If you're using ANIb, then this is typically because BLAST alignment of A to B is (mathematically and practically) not the same thing as the alignment of B to A. With ANIb, when a genome is the target sequence, it is a set of 1020nt fragments; when it's the query it's a (near-)contiguous sequence. The whole operation is fundamentally asymmetrical. This is the way the algorithm is defined in the original publications. I prefer ANIm because of this.

ANIm uses MUMmer which - in principle - can produce symmetrical alignments. It does a better job of doing so than ANIb, but in practice it's not quite symmetrical.

"3- How come sometimes I get coverage >100%?"

There can be several reasons. Sometimes it can be accumulated rounding error (small amounts over 100%), sometimes it's because the alignment involves gaps: the coverage is calculated as alignment length / genome length - if there's an insertion in the target genome and the query can align across it (by gap insertions) then the alignment length can be greater than the query genome length.

NOTE: it is also possible that all of these results can be due ti bugs in the application. When I see these events, I check the alignments to make sure that the code is interpreting them correctly. If you see the code giving you the wrong results for any alignments, please do raise a bug report.

Cheers,

L.

elzerac commented 5 years ago

Thank you so much! This helps a lot. You are right: I am finding that some studies are reporting only ANI and not genome coverage. I will be sure to not be one of these people! A comment also on ANI speciation cut-offs (94/95%): perhaps we should also consider a coverage cutoff for speciation.

This may be very naive and I did read your paper, but I am still unsure about this: how can one know what is homologous without first knowing the percent identity?

Thanks again for your quick response, I learned a lot.

widdowquinn commented 5 years ago

Naïve questions are the best. They expose our faulty assumptions.

"A comment also on ANI speciation cut-offs (94/95%): perhaps we should also consider a coverage cutoff for speciation."

I use a cutoff of 50% coverage on the grounds that, if less than half of each genome aligns/is homologous, then how can we say that the two genomes are "the same". I find that this simplifies interpretation.

Now I've begun preparing a paper with a rigorous mathematically-supported genome similarity definition of taxonomic boundaries, it's a deeper topic than I thought when I started…

"how can one know what is homologous without first knowing the percent identity?"

The technical definition of "homologous" is "sharing a common ancestor" but, as you realise, that's not very helpful - you still need a way to identify that they share a common ancestor ;) In practice we call sequences "homologous" when they align. This is on the grounds that if the sequences did not share a recognisable common ancestor, we would not be able to align them. On the face of it, that's sensible, but depending on how you align the sequences you might require different criteria to be satisfied - including a certain level of sequence identity. So, as you'd identified, our definitions might sometimes be circular… ;)

The situation's not as bad as all that, really. We have a "ground truth" for relatedness in that we have good taxonomies based on morphology and phenotype, and we can check our results against those to establish that our sequence-based classifications are reasonable, in general. When we know they're reasonable for organisms for which we have alternative morphological/phenotypic approaches, we can have confidence that our classifications for novel sequences are likely to be reliable, too.

elzerac commented 5 years ago

Thank you so much for these explanations. It makes a lot more sense now! I look forward to reading your next paper.

widdowquinn commented 5 years ago

Glad to be of help.