widdowquinn / pyani

Application and Python module for average nucleotide identity analyses of microbes.
http://widdowquinn.github.io/pyani/
MIT License
197 stars 55 forks source link

Is it possible to add AAI? #185

Closed nascimentofx closed 4 years ago

nascimentofx commented 4 years ago

Dear widdowquinn

is it possible to add AAI to the current program?

Just by changing your genbank_get_genomes_by_taxon i could easily download faa files from assemblies instead of fna, so the AAI script should be straightforward to use.

Thanks

Francisco

widdowquinn commented 4 years ago

Hi Francisco,

Thank you for your interest in pyani.

It's possible, but it still needs to be coded by someone. There are additional steps to AAI over and above genome alignment, such as identifying (putative) orthologues for comparison, and handling paralogy. That is not entirely trivial (see #16). It's on the list of things to do, but it's not my number one priority. I do take pull requests, if you'd like to get involved.

As this is a duplicate of #16, I'm closing this issue.

Cheers,

L.

nascimentofx commented 4 years ago

Thank you very much for your fast response and sorry for the repetition. Couldn't we just perform BLAST (1vs1) of the total proteomes, using an e-value that is strong enough to remove noise (10-50) and then use an average of that value? Then there could be another value such as the number of hits vs non-hits. I known this could be misleading, but there is always an error associated with this kind of analysis, similar to ANI which currently uses a low threshold for calculating identity. But that could be a start.

widdowquinn commented 4 years ago

I don't agree that the proposal is a good way to use BLAST. E-values are dependent not only on the alignment resulting from a match in the database, but also the size of the database (in terms of number of letters) used for the search. It is not appropriate to set a single E-value (or, equivalently, bit score) for all searches - proteomes vary in size between organisms. I wouldn't let that through as a method. A rigorous way to proceed would to be to identify putative orthologues (e.g. with RBBH) and go from there. If you want to remove "noise" in the form of, say, matches of domains rather than full proteins, then you'd need to do something else, like filter RBBHs with %identity and %coverage, rather than E-value. See, for instance, this blog post.

As you're probably aware, there are multiple approaches for calculating ANI - I wouldn't consider any of them to have a low threshold for calculating identity. Homologous regions are identified, and then the average sequence identity of the aligned homologous regions is calculated. The opportunity to parameterise is in the identification of homologous regions; the nature of DNA sequences is that only regions with >70-80% identity are likely to be identified as "homologous." - as you can quickly verify from querying dissimilar genomes against each other with BLAST (or other tool) and looking for the lowest %identity in the tabular output.

Cheers,

L.

nascimentofx commented 4 years ago

Dear Leighton

I believe that this AAI issue is being "overcomplicated"… Maybe what I am thinking is not AAI.

Strains that belong to the same species do share a very high identity in their proteins (see below). So the threshold should be very simple to implement. Overall we want to check the identity between equal proteins of strains (a factual number) and not check if there are putative homologs for each protein sequence in a reference strain that can perform similar functions. That is a different subject. So if a strain that is closely related to other (i.e. same species) has an enzyme, this should be pretty identical, at least >70-80%, as you described for DNA. Similar homolog/paralog enzymes are usually way below this value. Since the blast hit looks for the best hit possible, if you only allow 1 hit, you will get a value that is high (equal protein) or low (similar), the latter being discarded or not using a threshold.

I agree that using e-value alone is limiting, so the thresholds could be better defined by %id and %coverage. See the example below:

I've performed BLAST of two strains proteomes (each CDS vs each CDS). The result is limited to a single hit (the one limited by e-value). 1 query protein =1 hit reference (best hit). For instance, strain 1 (3260 CDS) vs strain 2 (3274 CDS). ANI 1 vs 2= 91.6%

BlastP results= 1 vs 2 = case (1) 2457 hits (evalue -50), case (2) 2437 (e value -50+ 70% cov.), case (3) 2417 (evalue- 50, 70% cov, >40% identity). average protein id/cov%= case 1= 92.89%, 98.95% ; case 2= 93.01%, 99.31; case 3= 96.82%, 99.68%.

BlastP results= 1vs 2= case (4) 2740 hits (evalue -20), case (5) 2663 (e value -20+ >70% cov.), case (6) 2417 (evalue -20, >70% cov, >40% identity). average total protein id/cov%= case 4= 90.64%, 97.96%; case 5= 90.37%, 98.67%; case 6= 92.47%, 99.04%

I would opt for something like a threshold in case 6 (bold). Other measures could be applied (limiting the use of a protein 1 unique time). As you can see the results are pretty straightforward, very high id and coverage levels that correlate well with ANI. Overall this protein id value is higher than ANI, as expected, because nucleotide identities may not reflect amino acid changes in proteins.

These values now could be related to hits vs no hits (total aligned sequences) using the thresholds defined. If we use case 6 scenario, then 3260-2417= 843 unique/not matched = 25,85%. This value would be the equivalent to alignment errors in ANI? You would expect this value to increase in strains that are distantly related.

Maybe this isn’t AAI, but average protein identity?

The computational time to calculate this is very low, so this could be very simple.

On 1 May 2020, at 15:38, Leighton Pritchard notifications@github.com wrote:

I don't agree that the proposal is a good way to use BLAST. E-values are dependent not only on the alignment resulting from a match in the database, but also the size of the database (in terms of number of letters) used for the search. It is not appropriate to set a single E-value (or, equivalently, bit score) for all searches - proteomes vary in size between organisms. I wouldn't let that through as a method. A rigorous way to proceed would to be to identify putative orthologues (e.g. with RBBH) and go from there. If you want to remove "noise" in the form of, say, matches of domains rather than full proteins, then you'd need to do something else, like filter RBBHs with %identity and %coverage, rather than E-value. See, for instance, this blog post https://armchairbiology.blogspot.com/2012/07/on-reciprocal-best-blast-hits.html.

As you're probably aware, there are multiple approaches for calculating ANI - I wouldn't consider any of them to have a low threshold for calculating identity. Homologous regions are identified, and then the average sequence identity of the aligned homologous regions is calculated. The opportunity to parameterise is in the identification of homologous regions; the nature of DNA sequences is that only regions with >70-80% identity are likely to be identified as "homologous." - as you can quickly verify from querying dissimilar genomes against each other with BLAST (or other tool) and looking for the lowest %identity in the tabular output.

Cheers,

L.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/widdowquinn/pyani/issues/185#issuecomment-622414068, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7FLELPIABUMSUCRSOMDOTRPLNF3ANCNFSM4MXDMDVA.

widdowquinn commented 4 years ago

It's as complicated as it needs to be. The premise is simple:

  1. identify things that are equivalent
  2. calculate their sequence identity

But "equivalent" has multiple interpretations. You call them "equal proteins" - but what is your precise, reproducible, and - to be blunt - programmable definition of "equal"? This needs also to be biologically meaningful for the analysis to be worthwhile.

Apologies if this sounds patronising, but you might find this introduction to some of the complexity of identifying whether proteins are "equivalent" to be useful.

nascimentofx commented 4 years ago

Thanks for your help and response.

I admit that it is difficult and subjective to select a value for equal proteins. However, there are several ways of looking at it. You can think of it function wise (the dilema) or you can look at it sequence wise (forget function) and just use a threshold (it could even be arbitrary). Again ANI isn’t perfect either. Everybody knows that not all the genes and regions of a genome evolve in a similar manner, some are cryptic, others regions of 0 biological value, still, all the genome is broken in pieces and used in the analysis. Also thresholds (95% for same species) were created based on few analysis. If you apply 95% ANI in some genera (e.g. Arthrobacter and other Actinobacteria) you will find that in 200 genomes 2 or 3 present > 95% ANI. Actinobacteria are ancient and are evolving for an increased period of time when compared to some Proteobacteria (where most studies are performed) so their distances are increased. Based on the current threshold every Arthrobacter genome represents a novel species... So evolutionary times, mutation rates, overall genome evolution (GC content, codon frequency, HGT, plasmids, transposases, etc... ) are much ignored in ANI. AAI could help improve some of these questions.

Anyway, maybe there is a different way of looking at it. Instead of defining what is equal, maybe we could just define setpoints. Instead of a value of id, there could be a number of proteins matched. For example, a setpoint 95, 80, 70, 50. There would be Blastp analysis and then the use of these thresholds for looking at the number of proteins included.

95% id/cov 2000 of 3000= % hits 80% 2500 of 3000= % 70% 2700 of 3000= % 50% 2800 of 3000=%

The closer the bacteria the more hits above a setpoint there would be. Using this we would escape the protein function/equivalence dilema and we could really find the most significant threshold based on comparisons to ANI values.

Thanks for you patience! Francisco

On 1 May 2020, at 18:40, Leighton Pritchard notifications@github.com wrote:

It's as complicated as it needs to be. The premise is simple:

identify things that are equivalent calculate their sequence identity But "equivalent" has multiple interpretations. You call them "equal proteins" - but what is your precise, reproducible, and - to be blunt - programmable definition of "equal"? This needs also to be biologically meaningful for the analysis to be worthwhile.

Apologies if this sounds patronising, but you might find this introduction useful to some of the complexity of identifying whether proteins are "equivalent."

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2518264/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2518264/ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/widdowquinn/pyani/issues/185#issuecomment-622486985, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7FLELSOLUY53QGXJ2EEUTRPMCQTANCNFSM4MXDMDVA.

widdowquinn commented 4 years ago

Hi Francisco,

It is true that some genera/species have surprisingly few members that share more than 95% genome identity. I would suggest that you may find value in considering the practical details of how taxonomy has been assigned, historically, when interpreting this.

It is not true that actinobacteria have been evolving for longer than proteobacteria. All bacteria have been evolving for the same period of time, since their most recent common ancestor.

You may enjoy reading about LINBase, and the correspondence of ANI to taxonomic classes in bacteria, especially in the context of discontinuities in similarity scores/identities.

L.

nascimentofx commented 4 years ago

I would suggest that you may find value in considering the practical details of how taxonomy has been assigned, historically, when interpreting this.

Of course, it needs to start somewhere. In this case we started from 16S, which is not that good of a marker… Now people are sticking to it and adapting stuff on it. For instance, I can tell you that in Arthrobacter in many other bacteria the 16S rRNA evolution is not related to overall genome evolution, at all. In Arthro, 16S has an overall GC% of 55% but then you look at genome GC and you get 69%! 14% difference… The you have strains sharing 99.9% 16S id, and ANI 82%… The genetic mechanisms controlling 16S rRNA evolution are obviously not the same than for other genes (including housekeeping). This is in one genus, and are thousands of them...So, those comparisons of 16S id vs ANI are of limited value. It is not true that actinobacteria have been evolving for longer than proteobacteria. All bacteria have been evolving for the same period of time, since their most recent common ancestor.

This is debatable... Also, time is not the sole factor. You have adaptation to niches, which in turn induce/reduce selections and "bend time”, so… For instance, P. aeruginosa has a Genome GC of 68-69%, and you have Pseudomonas fluorescens with 56%… and then you have P. putida with 58%… Obviously, analysis group Pf and Pp closer… Have they all evolved equally in the same time span?...

You may enjoy reading about LINBase https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa190/5813796, and the correspondence of ANI to taxonomic classes in bacteria https://www.nature.com/articles/s41467-018-07641-9, especially in the context of discontinuities in similarity scores/identities.

Thanks!

On 2 May 2020, at 12:21, Leighton Pritchard notifications@github.com wrote:

Hi Francisco,

It is true that some genera/species have surprisingly few members that share more than 95% genome identity. I would suggest that you may find value in considering the practical details of how taxonomy has been assigned, historically, when interpreting this.

It is not true that actinobacteria have been evolving for longer than proteobacteria. All bacteria have been evolving for the same period of time, since their most recent common ancestor.

You may enjoy reading about LINBase https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa190/5813796, and the correspondence of ANI to taxonomic classes in bacteria https://www.nature.com/articles/s41467-018-07641-9, especially in the context of discontinuities in similarity scores/identities.

L.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/widdowquinn/pyani/issues/185#issuecomment-622938093, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7FLELNDHLLQB3RQ7X4WUTRPP62DANCNFSM4MXDMDVA.