Open kiepczi opened 9 months ago
The additional columns in the run_results
output table should now be available under the branch issue_421
.
When running the pyANI
analysis on 2 genomes, it will perform 2 comparisons (forward and reverse), resulting in the following changes:
run_results
output table will double (one for the reverse comparison and one for the forward comparison).% query identity
and one for % subject identity
.% query coverage
and one for % subject coverage
.% query alignment length
and one for % subject alignment length
.Rerunning the analysis under the issue_421
branch, the following output was generated:
Comparison ID Query ID Query description Subject ID Subject description % query identity % subject identity % query coverage % subject coverage query alignment length subject alignment length similarity errors program version fragment size maxmatch Run ID
0 1 2 genome_B 1 genome_A 0.9795918367346939 0.98 1.0 1.0 98 100 2 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) False 1
1 2 1 genome_A 2 genome_B 0.98 0.9795918367346939 1.0 1.0 100 98 2 nucmer Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer) False 1
The code, input, and output data are provided in aln_length_issue.zip.
As we are now running 2 comparisons instead of just one, both forward and reverse, we also had to update the matrices. However, since this was initially explained in more detail in issue #151, I discussed this in more details there.
Looking through the changes, I see that the calculation looks sensible, and the changes to reporting etc. seem OK - good job! There are a few snags, some of which are being picked up when running pytest
locally. The main points are, I think…
query_identity
and subject_identity
: run.comparisons.append(
Comparison(
query=job.query,
subject=job.subject,
query_aln_length=qaln_length,
subject_aln_length=saln_length,
sim_errs=sim_errs,
query_identity=query_pid,
subject_identity=subject_pid,
cov_query=qcov,
cov_subject=scov,
program="nucmer",
In the new code we're storing a single value, perc_id
:
run.comparisons.append(
Comparison(
query=job.query,
subject=job.subject,
query_aln_length=qaln_length,
subject_aln_length=saln_length,
sim_errs=sim_errs,
perc_id=pc_id,
cov_query=qcov,
cov_subject=scov,
program="nucmer",
but the database schema and pyani_orm.py
code have not been updated. What are the plans for this?
tests/
to be sure that the new percentage identity code works as expected/intended. There are other side-effects (e.g. presence/absence of the sym
argument, other keyword incompatibilites) flagged by pytest
that need attention. I'll try to address a couple of these just now, and we'll talk more at the Monday meeting.
We're getting quite a bit discrepancy in the concordance test for ANIm:
> assert result_pid - tgt_pid == pytest.approx(0, abs=tolerance_anim)
E assert array([[0.0, ... dtype=object) == 0 ± 1.0e-01
E comparison failed
E Obtained: [[0.0 -0.27337324666439144 0.05531150545986918]\n [-0.29824688924479403 0.0 -1.0995470765852389]\n [0.05778732127045316 -1.073189551092085 0.0]]
E Expected: 0 ± 1.0e-01
tests/test_concordance.py:203: AssertionError
We'll need to look into this to see what's going on.
I've modified the expected output in test_anim_delta()
so that tests pass, assuming that the calculations are correct. We should go through the calculation for that test file manually and confirm the result - logging here that we've done so.
UPDATE: this induces a new error in test_deltidir_parsing()
- I'm looking into it. The expected/returned matrices are quite different. We'll need to confirm what's correct.
NC_002696 NC_010338 NC_011916 NC_014100
NC_002696 1.0 0.760899 0.999974 0.693209
NC_010338 1.0 1.000000 0.760913 0.700076
NC_011916 1.0 1.000000 1.000000 0.693673
NC_014100 1.0 1.000000 1.000000 1.000000
NC_002696 NC_010338 NC_011916 NC_014100
NC_002696 1.000000 0.852072 0.999974 0.868745
NC_010338 0.852072 1.000000 0.852027 0.853775
NC_011916 0.999974 0.852027 1.000000 0.868795
NC_014100 0.868745 0.853775 0.868795 1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all 1.0
as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.
The other values are more worrying. They agree in only one case, and we'll need to investigate what that is.
- We're now handling database entries for a comparison differently. In the previous code we stored
query_identity
andsubject_identity
:run.comparisons.append( Comparison( query=job.query, subject=job.subject, query_aln_length=qaln_length, subject_aln_length=saln_length, sim_errs=sim_errs, query_identity=query_pid, subject_identity=subject_pid, cov_query=qcov, cov_subject=scov, program="nucmer",
In the new code we're storing a single value,
perc_id
:run.comparisons.append( Comparison( query=job.query, subject=job.subject, query_aln_length=qaln_length, subject_aln_length=saln_length, sim_errs=sim_errs, perc_id=pc_id, cov_query=qcov, cov_subject=scov, program="nucmer",
These changes were made due to our conversation last week about how we want to calculate the percentage identity. Our previous assumption was that the percentage identity should be calculated separately for the query and subject by calculating it as follows: Percentage identity = 1 - (similarity error / alignment length)
However, we later agreed that the best option would be to calculate the average across all alignments. This would be calculated as follows (assuming that there are two alignments between the genomes we are comparing):
Alignment 1 identity weighted = (query length alignment 1 + subject length alignment 1) - (2 similarity error) Alignment 2 identity weighted = (query length alignment 2 + subject length alignment 2) - (2 similarity error) Percentage identity = (alignment 1 identity weighted + alignment 2 identity weighted) / (query length alignment 1 + subject length alignment 1 + query length alignment 2 + subject length alignment 2)
With this, I assumed that there would only be one calculation for percentage identity, hence just one column, rather than two.
However, we later agreed that the best option would be to calculate the average across all alignments. This would be calculated as follows (assuming that there are two alignments between the genomes we are comparing): […] With this, I assumed that there would only be one calculation for percentage identity, hence just one column, rather than two.
Yes - the point here is rather now that the call to create a Comparison()
object does not match what is defined in the pyani_orm.py
file, so we need to consider the way these changes to the code affect (i) how the program behaves and (ii) how we should - or should not - modify the database schema to accommodate that.
NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.0 0.760899 0.999974 0.693209 NC_010338 1.0 1.000000 0.760913 0.700076 NC_011916 1.0 1.000000 1.000000 0.693673 NC_014100 1.0 1.000000 1.000000 1.000000 NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.000000 0.852072 0.999974 0.868745 NC_010338 0.852072 1.000000 0.852027 0.853775 NC_011916 0.999974 0.852027 1.000000 0.868795 NC_014100 0.868745 0.853775 0.868795 1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all
1.0
as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.
I'm looking into this just now, and running this on my local machine I get the following matrix_identity:
NC_002696:1 NC_010338:2 NC_011916:3 NC_014100:4
NC_002696:1 1.0 0.7608992767 0.9999742381000001 0.6932093551
NC_010338:2 0.760899887 1.0 0.7609126078 0.7000758335
NC_011916:3 0.9999742381000001 0.7609120054 1.0 0.6936734489
NC_014100:4 0.6934934418 0.6997481289 0.6939592582 1.0
Am I right to assume that you have run this comparisons under branch issue_421
?
NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.0 0.760899 0.999974 0.693209 NC_010338 1.0 1.000000 0.760913 0.700076 NC_011916 1.0 1.000000 1.000000 0.693673 NC_014100 1.0 1.000000 1.000000 1.000000 NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.000000 0.852072 0.999974 0.868745 NC_010338 0.852072 1.000000 0.852027 0.853775 NC_011916 0.999974 0.852027 1.000000 0.868795 NC_014100 0.868745 0.853775 0.868795 1.000000
I've run dnadiff
locally on the NC_011916.fna vs NC_014100.fna comparison .filter
file. This returns:
[REF] [QRY]
[Sequences]
TotalSeqs 1 1
AlignedSeqs 1(100.00%) 1(100.00%)
UnalignedSeqs 0(0.00%) 0(0.00%)
[Bases]
TotalBases 4042929 4655622
AlignedBases 2259662(55.89%) 2261394(48.57%)
UnalignedBases 1783267(44.11%) 2394228(51.43%)
[Alignments]
1-to-1 1189 1189
TotalLength 2263684 2264006
AvgLength 1903.86 1904.13
AvgIdentity 86.88 86.88
M-to-M 1189 1189
TotalLength 2263684 2264006
AvgLength 1903.86 1904.13
AvgIdentity 86.88 86.88
NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.0 0.760899 0.999974 0.693209 NC_010338 1.0 1.000000 0.760913 0.700076 NC_011916 1.0 1.000000 1.000000 0.693673 NC_014100 1.0 1.000000 1.000000 1.000000 NC_002696 NC_010338 NC_011916 NC_014100 NC_002696 1.000000 0.852072 0.999974 0.868745 NC_010338 0.852072 1.000000 0.852027 0.853775 NC_011916 0.999974 0.852027 1.000000 0.868795 NC_014100 0.868745 0.853775 0.868795 1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all
1.0
as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.I'm looking into this just now, and running this on my local machine I get the following matrix_identity:
NC_002696:1 NC_010338:2 NC_011916:3 NC_014100:4 NC_002696:1 1.0 0.7608992767 0.9999742381000001 0.6932093551 NC_010338:2 0.760899887 1.0 0.7609126078 0.7000758335 NC_011916:3 0.9999742381000001 0.7609120054 1.0 0.6936734489 NC_014100:4 0.6934934418 0.6997481289 0.6939592582 1.0
Am I right to assume that you have run this comparisons under branch
issue_421
?
Yes - which branch should I be looking at?
Yes - which branch should I be looking at?
Branch Issue_421
So we're running under the same branch but getting different results?
In both cases the calculated identities do not correspond to the expected identity, or to applying dnadiff
to the same .delta
format files. This needs to be addressed.
Considering our discussion yesterday @kiepczi - I've updated the reference matrix for ANIm identity consistency tests in 97e5570. This now reflects (i) the new method for calculating identity, and (ii) running A vs B and B vs A (i.e. nonsymmetry).
A couple of things worth noting:
dnadiff
from MUMmer, even though we have adopted their approach almost exactly. We might like to assume that the remaining difference: dnadiff
sums precalculated percentages for each alignment fragment in an .mcoords
file, rounded to two decimal places (and so is subject to accumulations of rounding error) where we maintain a single float
while processing the .delta
file directly (a more exact calculation) is responsible for this. Even so, confirming that by running a similar calculation where we do the rounding and checking against the dnadiff
output might be a good idea.(pyani_py311) lpritc@Rodan pyani % pytest -v tests/test_anim.py::test_deltadir_parsing [8:47:35]
=============================================================== test session starts ===============================================================
platform darwin -- Python 3.11.7, pytest-7.4.3, pluggy-1.3.0 -- /Users/lpritc/opt/anaconda3/envs/pyani_py311/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/lpritc/Documents/Development/GitHub/pyani
configfile: pytest.ini
plugins: cov-4.1.0, anyio-4.2.0
collected 1 item
tests/test_anim.py::test_deltadir_parsing PASSED [100%]
================================================================ 1 passed in 1.30s ================================================================
I have updated more tests and fixtures in test_anim.py
to reflect the changes made in the codebase (see da8852c). This included:
i) Adding more commands to the mummer_cmds_four
fixture. We now run pairwise comparisons in both directions (forward and reverse). This adjustment allowed us to pass the test_mummer_multiple
and test_mummer_job_generation
tests successfully.
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_multiple
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_mummer_multiple PASSED [100%]
==================================================================== 1 passed in 0.12s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_job_generation
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_mummer_job_generation PASSED [100%]
==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %
ii) Modification of the deltafile_parsed
fixture. We now expect different values to be returned by the parse_delta
function. This change is due to adjustments made in the codebase, where we no longer double count overlapping regions, calculate weighted ANIm %ID, and return additional items such as query alignment length, subject alignment length, weighted %ID, and similarity errors. This allowed to pass test_deltafile_parsing
test.
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_deltafile_parsing
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item
tests/test_anim.py::test_deltafile_parsing PASSED [100%]
==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %
Although the majority of tests pass, three are currently failing:
================================================================= short test summary info =================================================================
FAILED tests/test_anim.py::test_genome_sorting - AssertionError: assert ('nucmer --mu...first.filter') == ('nucmer --mu...econd.filter')
FAILED tests/test_parsing.py::test_anim_delta - AssertionError: assert (4016947, 401...7031752, 2191) == (4016947, 401...4447228, 2191)
FAILED tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani - TypeError: 'aln_length' is an invalid keyword argument for Comparison
======================================= 3 failed, 115 passed, 1 xfailed, 1 xpassed, 3 warnings in 389.17s (0:06:29) =======================================
Here are my current thoughts on them:
The tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani
fails because there is no longer a all_length
keyword. This was expected, as we modified the codebase to run pairwise comparisons in both directions (forward and reverse). Because of this, we have decided to change how the alignment lengths are now calculated and reported. The aln_length
keyword in the Comparison
class was replaced with query_aln_length
and subject_aln_length
. I would need a little more investigation of the fastani
approach to come up with the best solution for this.
The tests/test_anim.py::test_genome_sorting
is now failing because, by modifying the codebase to run both comparisons (reverse and forward), I have never taken sorting into consideration. I think restoring sorting would be beneficial as it would help us obtain consistent and reproducible results for future testing and easier debugging.
The tests/test_parsing.py::test_anim_delta
fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.
@widdowquinn, if you have any ideas comments regarding this, let me know.
2. The
tests/test_anim.py::test_genome_sorting
is now failing because, by modifying the codebase to run both comparisons (reverse and forward), I have never taken sorting into consideration. I think restoring sorting would be beneficial as it would help us obtain consistent and reproducible results for future testing and easier debugging.
It would be worth looking back through the commit logs to be sure, but my memory is that sorting was introduced because…
nucmer_output
into subdirectories, for easier file management (a good thing)anim
comparison carried out one pairwise comparison which may have gone into A/A_vs_B
or B/B_vs_A
- somewhat unpredictably.Now, with anim
that does both A vs B and B vs A I don't think we need to sort, any more. The last commit I made above skips a sorting test in test_anim.py
- you might like to use that as a template for how to handle the test_genome_sorting()
test, here.
3. The
tests/test_parsing.py::test_anim_delta
fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.
Thing 1 is to satisfy ourselves that that's the origin of the discrepancy ;) If it is, then the job of implementing that part of pyani dnadiff
is slightly easier than we at first thought…
- The
tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani
fails because there is no longer aall_length
keyword. This was expected, as we modified the codebase to run pairwise comparisons in both directions (forward and reverse). Because of this, we have decided to change how the alignment lengths are now calculated and reported. Thealn_length
keyword in theComparison
class was replaced withquery_aln_length
andsubject_aln_length
. I would need a little more investigation of thefastani
approach to come up with the best solution for this.
One solution here is - essentially - polymorphism. We would declare new classes that describe output from distinct tools (e.g. DnadiffResult
, NucmerResult
, FastANIResult
), and have the ORM function process them differently. That's probably the most elegant and explicit way to do it, rather than try to overextend the function that's being called. We should talk about that in person and get something down in a document.
- The
tests/test_parsing.py::test_anim_delta
fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.Thing 1 is to satisfy ourselves that that's the origin of the discrepancy ;) If it is, then the job of implementing that part of
pyani dnadiff
is slightly easier than we at first thought…
I looked into how dnadiff.pl
calculates the Average %ID for 1-to-1 alignments.
To begin, it assigns the combined alignment lengths (rqSumLen1
) and the sum of weighted alignment length identities (rqSumIdy1
) to 0.
my ($rqSumLen1, $rqSumLenM) = (0,0); # combined alignment length sum
my ($rqSumIdy1, $rqSumIdyM) = (0,0); # weighted alignment identity sum
The weighted alignment lengths are then calculated by dividing the average %ID of each sequence by 100 and multiplying it by the sum of the lengths of both the reference and query sequences (note that these values are extracted from the 1coords
file).
$rqSumIdy1 += ($A[6] / 100.0) * ($A[4] + $A[5]);
The length of alignments is calculated as the sum of the lengths of both the reference and query sequences:
$rqSumLen1 += ($A[4] + $A[5]);
The Average Identity is calculated by dividing the sum of all weighted alignment lengths by the sum of all combined alignment lengths and then multiplying that value by 100.
($rqSumLen1 ? $rqSumIdy1 / $rqSumLen1 * 100.0 : 0);
I manually calculated this for a small comparison of two viral genomes, where according to dnadiff
(1-to-1 alignments), the average %ID is 99.63:
[Alignments]
1-to-1 2 2
TotalLength 59174 59187
AvgLength 29587.00 29593.50
AvgIdentity 99.63 99.63
The 1coords
file:
85 37713 1 37636 37629 37636 99.43 39253 39594 95.86 95.05 MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457
17709 39253 17626 39176 21545 21551 99.97 39253 39594 54.89 54.43 MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457
Then, the calculations I get are: alignment weighted 1 = (99.43 / 100) (37629 + 37636) = 74835.9895 alignment weighted 2 = (99.97 / 100) (21545 + 21551) = 43083.0712 Average %ID = (74835.9895 + 43083.0712) / (37629 + 37636 + 21545 + 21551) * 100 = 99.6266174669021 ≈ 99.63
I attempted to replicate this value using delta
files instead, but as we know, these were slightly different, usually at the second decimal place. We discussed earlier that this could be due to the rounding error coming from the %ID of each individual sequence given in the 1coords
file.
So, let's calculate these values without rounding the individual sequence %IDs from the delta
file. The file is as follows:
/Users/angelikakiepas/Desktop/pyani/issue_421/rounding_error/scripts/../data/donovan_test/input/MGV-GENOME-0264574.fna /Users/angelikakiepas/Desktop/pyani/issue_421/rounding_error/scripts/../data/donovan_test/input/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0
For the first alignment calculations: aligned reference bases = 37713 - 85 + 1 = 37629 aligned query bases = 37636 - 1 + 1 = 37636 similarity errors: 215 percentage id = ((37629 + 37636) - (2 215)) / (37629 + 37636) 100 = 99.42868531189795 alignment weighted 1 = (99.42868531189795 / 100) * (37629 + 37636) = 74834.99999999999
For the second alignment calculations: aligned reference bases = 39253 - 17709 + 1 = 21545 aligned query bases = 39176 - 17626 + 1 = 21551 similarity errors: 7 percentage id = ((21545 + 21551) - (2 7)) / (21545 + 21551) 100 = 99.96751438648599 alignment weighted 2 = (99.96751438648599 / 100) * (21545 + 21551) = 43082.0
Average %ID = (74834.99999999999 + 43082.0) / (37629 + 37636 + 21545 + 21551) * 100 = 99.62487643733999
@widdowquinn suggested that there could be a quicker way to calculate this by skipping the intermediate calculation of identity as follows:
alignment 1 identity weighted = (37629 + 37636) - (2 215) = 74835 alignment 2 identity weighted = (21545 + 21551) - (2 7) = 43082 Average %ID = (74835 + 43082) / (37629 + 37636 + 21545 + 21551) *100 = 99.62487643734
This example demonstrates that the differences in the reported numbers could arise from a rounding error, and it appears that our calculations would be more accurate.
I have written a Python script that calculates these values using these approaches (available here). I ran this on a few examples and found that in some cases, even when we calculate the Average %ID where the individual aligned sequences %ID is rounded, the values differ from the ones reported by dnadiff
.
When the values are calculated in Python, I obtain the following results:
dnadiff
(percentage identities for each found alignment were rounded to the 2nd decimal place).dnadiff
, but no %ID for each found alignment were rounded to the 2nd decimal place.The average %ID reported by dnadiff
is 85.35%, so the difference is quite significant:
[Alignments]
1-to-1 1455 1455
TotalLength 1679781 1679569
AvgLength 1154.49 1154.34
AvgIdentity 85.35 85.35
I checked if the sequence percentage identities calculated for each alignment match the ones reported in the 1coords
file. To my surprise, I found that the majority of these do not match.
One example where the percentage identity for the alignment does not match is:
1486 3045 124165 125722 1560 1558 88.49 16110 7787608 9.68 0.02 NZ_NOWW01000027.1 NZ_CP026730.1
Using the above information, I found the alignment in the delta file
:
>NZ_NOWW01000027.1 NZ_CP026730.1 16110 7787608
1486 3045 124165 125722 180 180 0
8
-53
493
-4
743
-18
3
123
-6
33
0
When calculating the percentage identity manually, I found that these do not match. In this example, the difference is almost 0.035%. As this is happening in most of the alignments in this comparison, the differences add up to the point that the overall Average %ID differs significantly.
Reference alignment length = 3045 - 1486 + 1 = 1560 Query alignment length = 125722 - 124165 + 1 = 1558 Percentage ID = ((1560 + 1558) - (2 180)) / (1560 + 1558) 100 = 88.45413726747915
I should probably mention that all code and data used for the investigation can be found here.
After discussion between @kiepczi and me, and inspecting the code of show-coords
, it appears that - in some cases at least - the graph model internal to show-coords
is leading to a different measure of sequence identity than would be obtained from our (naive) reading of the .delta
files.
We cannot then guarantee that our pyani
-calculated percentage identity matches for any particular region exactly match those from dnadiff
, or that they will match within simple rounding error (although in many cases they actually do so).
At this point, my view is that the pyani
approach of considering all alignments in the .delta
file, and the dnadiff
approach of (apparently) constructing a graph model from those alignments are not straightforward to reconcile.
The least time-consuming, and probably most useful, way forward is, I think, this:
pyani
calculation method as the standard method for pyani anim
pyani dnadiff
method that generates the show-coords
and show-diff
outputs using MUMmer
's tools, allowing us to recreate the key parts of the dnadiff
output.pyani compare
make a thorough comparison between the operation of both methods for the resulting publication.Adding the pyani dnadiff
command should be reserved for pyani plus
, as per #424.
One solution here is - essentially - polymorphism. We would declare new classes that describe output from distinct tools (e.g. DnadiffResult, NucmerResult, FastANIResult), and have the ORM function process them differently. That's probably the most elegant and explicit way to do it, rather than try to overextend the function that's being called. We should talk about that in person and get something down in a document.
To expand on this… the error below
FAILED tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani - TypeError: 'aln_length' is an invalid keyword argument for Comparison
arises because the keywords provided from a fastANI
comparison are no longer the same as for an anim
comparison. Commit 693d0c2 changed the requirement from aln_length
to query_aln_length
and subject_aln_length
, though it is not clear that the underlying database schema changed (we need to check this and update to accommodate if needed!).
If we require only one of those two lengths for the ANI result, then we should prefer one over the other, and we ought to be able to provide it to the Comparison
object through a consistent interface. For example…
class ResultfastANI()
[...]
self.aln_length = #total alignment length as only one length makes sense
[...]
@property
def mylen():
return self.aln_length
class ResultNucmer()
[...]
self.query_aln_length = #self-explanatory
self.ref_aln_length = #self-explanatory
[...]
@property
def mylen():
return self.query_aln_length # because we are prioritising the _query_ alignment length by choice
But, if we require two lengths, then we will have to change the schema/reporting/ORM to accommodate this.
Which do we think we want?
If we only require one alignment length per comparison for our reporting purposes, then the first solution is sufficient.
Having both results from a tool like nucmer
, which reports query and reference alignment lengths (perhaps differently for A v B and B v A comparisons), might be interesing to some, but changing the internals is a large task, and would be better postponed to pyani-plus
where it can be accommodated in a schema redesign.
Summary:
When calling the report for specific runs with the --run_results argument, a single value for the length of alignment is given in the alignment length column. However, the lengths of the alignments for the subject and query might be different.
Description:
When two genomes are aligned with
nucmer
, the length of the query alignment and subject alignment might differ. For example, we have two genomes, genome A with 100 bases and genome B with 98 bases.We compare them with pyANI anim, and when we inspect .filer files, we see that there was a deletion of the 11th and 16th bases in genome B.
However, when calling pyani report with the --run_result argument, the .tab file reported by pyANI contains a single column for alignment length, which in this case seems to report the value of the length of the subject.
To fix this issue I propose that above
.tab
output is modified to provide two columns for alignment lengths, eg.subject alignment length
andquery alignment length
.NOTE: We are also aware that nucmer is not symmetrical (see #151). Therefore, we have decided to run two comparisons (reverse and forward), which we are currently working on. Since these issues are related, the output for matrices and reports will need to change to reflect this. Some work has already been done on this under the branch issue_421, and I will continue working on this issue there as well.
Reproducible Steps:
All input, outputs and code are provided here: aln_length_issue.zip.
Current Output:
Expected Output:
pyani Version:
pyani 0.3v