Add average amino acid identity (AAI)

ninjatacoshell commented 8 years ago

An alternative to ANI for more distantly related genomes is average amino acid identity (AAI; see Konstantinidis and Tiedje 2005 and Rodrigues and Konstantinidis 2014). Instead of DNA FASTA files the user would need to supply protein FASTA files.

This web tool only lets you calculate AAI for two genomes at a time.

This web tool lets you calculate AAI for multiple genomes, but only the ones that are stored in its database (i.e. no user-generated genomes). And the database doesn't appear to have been updated since around 2012.

This web tool lets you calculate AAI for up to 10 genomes at a time, but you have to run them through RAST, first, which is inconvenient.

So being able to run AAI on your own machine, like pyani already does for ANI and tetranucleotide regression, would be very useful.

widdowquinn commented 8 years ago

I like the idea, but I'm inclined to leave this to a later version of pyani that integrates with pyrbbh.

For AAI we need to define equivalent proteins for comparison. That's something which can be done in several ways, and I'd like to hand that method choice off to the user's preference. I'm not sure that there's a data standard for specifying such equivalence for pairs of proteins, or for whole groups - I'll have to put some time into looking around for one (suggestions welcome) or devising one that works here.

ninjatacoshell commented 8 years ago

In the original paper by Konstantinidis and Tiedje 2005 they performed AAI by searching all protein-coding sequences from the query genome against the reference genome using TBLASTN, with cut-offs of at least 30% identity and at least 70% coverage. They called this one-way BLAST. Then they took the top matching segment and performed the reverse search using BLASTX (presumably with the same cut-offs). They called this two-way BLAST. In their analysis the two-way BLAST was slightly more reliable.

How would BBH compare to their two-way BLAST in terms of computation time? And would it be invulnerable to inconsistencies in the annotation between different genomes the way their two-way BLAST is?

widdowquinn commented 8 years ago

The method from Konstantinidis and Tiedje is one of several ways to define 'equivalent proteins/CDS'. It happens to be one that doesn't require a prior protein annotation on the 'reference', but it does require one on the query.

The two-way BLAST search is likely to be more reliable than the one-way analysis for the same reasons RBH/BBH matches are more reliable than one-way BLAST matches, in general (as described in, e.g. https://github.com/widdowquinn/Teaching-Dundee-BS32010/blob/master/workshop_2/06-RBBH.ipynb and https://github.com/widdowquinn/Teaching-Dundee-BS32010/blob/master/lecture/2016-03-21_BS32010_Pritchard.pdf).

In terms of differences in computation time, I don't know off-hand how it would work out. I'd expect reciprocal BLASTP of a query protein complement against protein database of a reference protein complement to be faster than BLASTX of query against untranslated genome, but I wouldn't be upset if that wasn't true ;) As for inconsistencies in annotation - given that you have one protein annotation already in the K&T method, then I wouldn't consider it invulnerable to "annotation inconsistency". You could try two-way TBLASTX if you want to ignore annotation altogether (but although you're then invulnerable to annotation inconsistency, you also do not gain any of its many advantages…)

ninjatacoshell commented 8 years ago

I don't know if it will help, but they've put their script for calculating AAI (using Ruby) on GitHub: https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/aai.rb. Perhaps it (or part of it) can be rewritten for Python?

sbridel commented 7 years ago

Suggestion: https://github.com/dparks1134/CompareM using Diamond and Prodigal to find equivalent protein.

The AAI feature will be very nice in pyani

widdowquinn / pyani

Add average amino acid identity (AAI) #16