Add separate columns for subject and query alignment lengths in `--run_results`

widdowquinn / pyani

Application and Python module for average nucleotide identity analyses of microbes.

http://widdowquinn.github.io/pyani/

MIT License

198 stars 55 forks source link

Add separate columns for subject and query alignment lengths in `--run_results` #422

Open kiepczi opened 9 months ago

kiepczi commented 9 months ago

Summary:

When calling the report for specific runs with the --run_results argument, a single value for the length of alignment is given in the alignment length column. However, the lengths of the alignments for the subject and query might be different.

Description:

When two genomes are aligned with nucmer, the length of the query alignment and subject alignment might differ. For example, we have two genomes, genome A with 100 bases and genome B with 98 bases.

genome A = ATACCGGAGGAGGAGCTACATTGGAGAGGCATGTTGTGAGTTTGACACCAACCTTGGGTCGGGACAGTATTGATGGGTAAGCCGAACTCGGCATCGATGA
genome B = ATACCGGAGGGGAGTACATTGGAGAGGCATGTTGTGAGTTTGACACCAACCTTGGGTCGGGACAGTATTGATGGGTAAGCCGAACTCGGCATCGATGA

We compare them with pyANI anim, and when we inspect .filer files, we see that there was a deletion of the 11th and 16th bases in genome B.

/Users/angelikakiepas/Desktop/issue_X/input/genome_A.fasta /Users/angelikakiepas/Desktop/issue_X/input/genome_B.fasta
NUCMER
>genome_A genome_B 100 98
1 100 1 98 2 2 0
11
5
0

However, when calling pyani report with the --run_result argument, the .tab file reported by pyANI contains a single column for alignment length, which in this case seems to report the value of the length of the subject.

    Comparison ID   Query ID    Query description   Subject ID  Subject description % identity  % query coverage    % subject coverage  alignment length    similarity errors   program version fragment size   maxmatch    Run ID
0   1   1   genome_A    2   genome_B    0.98    1.0 1.0 100 2   nucmer  Darwin_3.1 (/opt/anaconda3/envs/pyani_v3/bin/nucmer)        False   1

To fix this issue I propose that above .tab output is modified to provide two columns for alignment lengths, eg. subject alignment length and query alignment length.

NOTE: We are also aware that nucmer is not symmetrical (see #151). Therefore, we have decided to run two comparisons (reverse and forward), which we are currently working on. Since these issues are related, the output for matrices and reports will need to change to reflect this. Some work has already been done on this under the branch issue_421, and I will continue working on this issue there as well.

Reproducible Steps:

All input, outputs and code are provided here: aln_length_issue.zip.

Current Output:

    Comparison ID   Query ID    Query description   Subject ID  Subject description % identity  % query coverage    % subject coverage  alignment length    similarity errors   program version fragment size   maxmatch    Run ID
0   1   1   genome_A    2   genome_B    0.98    1.0 1.0 100 2   nucmer  Darwin_3.1 (/opt/anaconda3/envs/pyani_v3/bin/nucmer)        False   1

Expected Output:

    Comparison ID   Query ID    Query description   Subject ID  Subject description % identity  % query coverage    % subject coverage  subject alignment length    query alignment length  similarity errors   program version fragment size   maxmatch    Run ID
0   1   1   genome_A    2   genome_B    0.98    1.0 1.0 100 98 2    nucmer  Darwin_3.1 (/opt/anaconda3/envs/pyani_v3/bin/nucmer)        False   1

pyani Version:

pyani 0.3v

kiepczi commented 9 months ago

The additional columns in the run_results output table should now be available under the branch issue_421.

When running the pyANI analysis on 2 genomes, it will perform 2 comparisons (forward and reverse), resulting in the following changes:

The number of rows in the run_results output table will double (one for the reverse comparison and one for the forward comparison).
Two columns for percentage identity will be present, one for % query identity and one for % subject identity.
Separate columns for percentage coverage will be included, one for % query coverage and one for % subject coverage.
Two columns for alignment lengths will be provided, one for % query alignment length and one for % subject alignment length.

Rerunning the analysis under the issue_421 branch, the following output was generated:

    Comparison ID   Query ID    Query description   Subject ID  Subject description % query identity    % subject identity  % query coverage    % subject coverage  query alignment length  subject alignment length    similarity errors   program version fragment size   maxmatch    Run ID
0   1   2   genome_B    1   genome_A    0.9795918367346939  0.98    1.0 1.0 98  100 2   nucmer  Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)     False   1
1   2   1   genome_A    2   genome_B    0.98    0.9795918367346939  1.0 1.0 100 98  2   nucmer  Darwin_3.1 (/opt/anaconda3/envs/pyani_issue_421/bin/nucmer)     False   1

The code, input, and output data are provided in aln_length_issue.zip.

As we are now running 2 comparisons instead of just one, both forward and reverse, we also had to update the matrices. However, since this was initially explained in more detail in issue #151, I discussed this in more details there.

widdowquinn commented 8 months ago

Looking through the changes, I see that the calculation looks sensible, and the changes to reporting etc. seem OK - good job! There are a few snags, some of which are being picked up when running pytest locally. The main points are, I think…

We're now handling database entries for a comparison differently. In the previous code we stored query_identity and subject_identity:

        run.comparisons.append(
            Comparison(
                query=job.query,
                subject=job.subject,
                query_aln_length=qaln_length,
                subject_aln_length=saln_length,
                sim_errs=sim_errs,
                query_identity=query_pid,
                subject_identity=subject_pid,
                cov_query=qcov,
                cov_subject=scov,
                program="nucmer",

In the new code we're storing a single value, perc_id:

        run.comparisons.append(
            Comparison(
                query=job.query,
                subject=job.subject,
                query_aln_length=qaln_length,
                subject_aln_length=saln_length,
                sim_errs=sim_errs,
                perc_id=pc_id,
                cov_query=qcov,
                cov_subject=scov,
                program="nucmer",

but the database schema and pyani_orm.py code have not been updated. What are the plans for this?

We should implement more tests under tests/ to be sure that the new percentage identity code works as expected/intended.

There are other side-effects (e.g. presence/absence of the sym argument, other keyword incompatibilites) flagged by pytest that need attention. I'll try to address a couple of these just now, and we'll talk more at the Monday meeting.

widdowquinn commented 8 months ago

We're getting quite a bit discrepancy in the concordance test for ANIm:

>       assert result_pid - tgt_pid == pytest.approx(0, abs=tolerance_anim)
E       assert array([[0.0, ... dtype=object) == 0 ± 1.0e-01
E         comparison failed
E         Obtained: [[0.0 -0.27337324666439144 0.05531150545986918]\n [-0.29824688924479403 0.0 -1.0995470765852389]\n [0.05778732127045316 -1.073189551092085 0.0]]
E         Expected: 0 ± 1.0e-01

tests/test_concordance.py:203: AssertionError

We'll need to look into this to see what's going on.

widdowquinn commented 8 months ago

I've modified the expected output in test_anim_delta() so that tests pass, assuming that the calculations are correct. We should go through the calculation for that test file manually and confirm the result - logging here that we've done so.

UPDATE: this induces a new error in test_deltidir_parsing() - I'm looking into it. The expected/returned matrices are quite different. We'll need to confirm what's correct.

           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696        1.0   0.760899   0.999974   0.693209
NC_010338        1.0   1.000000   0.760913   0.700076
NC_011916        1.0   1.000000   1.000000   0.693673
NC_014100        1.0   1.000000   1.000000   1.000000
           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696   1.000000   0.852072   0.999974   0.868745
NC_010338   0.852072   1.000000   0.852027   0.853775
NC_011916   0.999974   0.852027   1.000000   0.868795
NC_014100   0.868745   0.853775   0.868795   1.000000

The test results matrix is on top (expected values are underneath). The lower triangle there is all 1.0 as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.

The other values are more worrying. They agree in only one case, and we'll need to investigate what that is.

kiepczi commented 8 months ago

We're now handling database entries for a comparison differently. In the previous code we stored query_identity and subject_identity:

        run.comparisons.append(
            Comparison(
                query=job.query,
                subject=job.subject,
                query_aln_length=qaln_length,
                subject_aln_length=saln_length,
                sim_errs=sim_errs,
                query_identity=query_pid,
                subject_identity=subject_pid,
                cov_query=qcov,
                cov_subject=scov,
                program="nucmer",

In the new code we're storing a single value, perc_id:

        run.comparisons.append(
            Comparison(
                query=job.query,
                subject=job.subject,
                query_aln_length=qaln_length,
                subject_aln_length=saln_length,
                sim_errs=sim_errs,
                perc_id=pc_id,
                cov_query=qcov,
                cov_subject=scov,
                program="nucmer",

These changes were made due to our conversation last week about how we want to calculate the percentage identity. Our previous assumption was that the percentage identity should be calculated separately for the query and subject by calculating it as follows: Percentage identity = 1 - (similarity error / alignment length)

However, we later agreed that the best option would be to calculate the average across all alignments. This would be calculated as follows (assuming that there are two alignments between the genomes we are comparing):

Alignment 1 identity weighted = (query length alignment 1 + subject length alignment 1) - (2 similarity error) Alignment 2 identity weighted = (query length alignment 2 + subject length alignment 2) - (2 similarity error) Percentage identity = (alignment 1 identity weighted + alignment 2 identity weighted) / (query length alignment 1 + subject length alignment 1 + query length alignment 2 + subject length alignment 2)

With this, I assumed that there would only be one calculation for percentage identity, hence just one column, rather than two.

widdowquinn commented 8 months ago

However, we later agreed that the best option would be to calculate the average across all alignments. This would be calculated as follows (assuming that there are two alignments between the genomes we are comparing): […] With this, I assumed that there would only be one calculation for percentage identity, hence just one column, rather than two.

Yes - the point here is rather now that the call to create a Comparison() object does not match what is defined in the pyani_orm.py file, so we need to consider the way these changes to the code affect (i) how the program behaves and (ii) how we should - or should not - modify the database schema to accommodate that.

kiepczi commented 8 months ago

           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696        1.0   0.760899   0.999974   0.693209
NC_010338        1.0   1.000000   0.760913   0.700076
NC_011916        1.0   1.000000   1.000000   0.693673
NC_014100        1.0   1.000000   1.000000   1.000000
           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696   1.000000   0.852072   0.999974   0.868745
NC_010338   0.852072   1.000000   0.852027   0.853775
NC_011916   0.999974   0.852027   1.000000   0.868795
NC_014100   0.868745   0.853775   0.868795   1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all 1.0 as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.

I'm looking into this just now, and running this on my local machine I get the following matrix_identity:

    NC_002696:1 NC_010338:2 NC_011916:3 NC_014100:4
NC_002696:1 1.0 0.7608992767    0.9999742381000001  0.6932093551
NC_010338:2 0.760899887 1.0 0.7609126078    0.7000758335
NC_011916:3 0.9999742381000001  0.7609120054    1.0 0.6936734489
NC_014100:4 0.6934934418    0.6997481289    0.6939592582    1.0

Am I right to assume that you have run this comparisons under branch issue_421?

widdowquinn commented 8 months ago

           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696        1.0   0.760899   0.999974   0.693209
NC_010338        1.0   1.000000   0.760913   0.700076
NC_011916        1.0   1.000000   1.000000   0.693673
NC_014100        1.0   1.000000   1.000000   1.000000
           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696   1.000000   0.852072   0.999974   0.868745
NC_010338   0.852072   1.000000   0.852027   0.853775
NC_011916   0.999974   0.852027   1.000000   0.868795
NC_014100   0.868745   0.853775   0.868795   1.000000

I've run dnadiff locally on the NC_011916.fna vs NC_014100.fna comparison .filter file. This returns:

                               [REF]                [QRY]
[Sequences]
TotalSeqs                          1                    1
AlignedSeqs               1(100.00%)           1(100.00%)
UnalignedSeqs               0(0.00%)             0(0.00%)

[Bases]
TotalBases                   4042929              4655622
AlignedBases         2259662(55.89%)      2261394(48.57%)
UnalignedBases       1783267(44.11%)      2394228(51.43%)

[Alignments]
1-to-1                          1189                 1189
TotalLength                  2263684              2264006
AvgLength                    1903.86              1904.13
AvgIdentity                    86.88                86.88

M-to-M                          1189                 1189
TotalLength                  2263684              2264006
AvgLength                    1903.86              1904.13
AvgIdentity                    86.88                86.88

widdowquinn commented 8 months ago

           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696        1.0   0.760899   0.999974   0.693209
NC_010338        1.0   1.000000   0.760913   0.700076
NC_011916        1.0   1.000000   1.000000   0.693673
NC_014100        1.0   1.000000   1.000000   1.000000
           NC_002696  NC_010338  NC_011916  NC_014100
NC_002696   1.000000   0.852072   0.999974   0.868745
NC_010338   0.852072   1.000000   0.852027   0.853775
NC_011916   0.999974   0.852027   1.000000   0.868795
NC_014100   0.868745   0.853775   0.868795   1.000000
The test results matrix is on top (expected values are underneath). The lower triangle there is all 1.0 as that's how the dataframe is initialised and - under the assumption of a symmetrical comparison - we didn't populate that half of the matrix. Correcting that should be straightforward.
I'm looking into this just now, and running this on my local machine I get the following matrix_identity:
  NC_002696:1 NC_010338:2 NC_011916:3 NC_014100:4
NC_002696:1   1.0 0.7608992767    0.9999742381000001  0.6932093551
NC_010338:2   0.760899887 1.0 0.7609126078    0.7000758335
NC_011916:3   0.9999742381000001  0.7609120054    1.0 0.6936734489
NC_014100:4   0.6934934418    0.6997481289    0.6939592582    1.0
Am I right to assume that you have run this comparisons under branch issue_421?

Yes - which branch should I be looking at?

kiepczi commented 8 months ago

Yes - which branch should I be looking at?

Branch Issue_421

widdowquinn commented 8 months ago

So we're running under the same branch but getting different results?

In both cases the calculated identities do not correspond to the expected identity, or to applying dnadiff to the same .delta format files. This needs to be addressed.

widdowquinn commented 8 months ago

Considering our discussion yesterday @kiepczi - I've updated the reference matrix for ANIm identity consistency tests in 97e5570. This now reflects (i) the new method for calculating identity, and (ii) running A vs B and B vs A (i.e. nonsymmetry).

A couple of things worth noting:

The difference between the new and old methods was typically at the fourth decimal place, so we shouldn't expect much to change in terms of classifications
The results don't agree exactly with dnadiff from MUMmer, even though we have adopted their approach almost exactly. We might like to assume that the remaining difference: dnadiff sums precalculated percentages for each alignment fragment in an .mcoords file, rounded to two decimal places (and so is subject to accumulations of rounding error) where we maintain a single float while processing the .delta file directly (a more exact calculation) is responsible for this. Even so, confirming that by running a similar calculation where we do the rounding and checking against the dnadiff output might be a good idea.

(pyani_py311) lpritc@Rodan pyani % pytest -v tests/test_anim.py::test_deltadir_parsing                                                   [8:47:35]
=============================================================== test session starts ===============================================================
platform darwin -- Python 3.11.7, pytest-7.4.3, pluggy-1.3.0 -- /Users/lpritc/opt/anaconda3/envs/pyani_py311/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/lpritc/Documents/Development/GitHub/pyani
configfile: pytest.ini
plugins: cov-4.1.0, anyio-4.2.0
collected 1 item                                                                                                                                  

tests/test_anim.py::test_deltadir_parsing PASSED                                                                                            [100%]

================================================================ 1 passed in 1.30s ================================================================

kiepczi commented 8 months ago

I have updated more tests and fixtures in test_anim.py to reflect the changes made in the codebase (see da8852c). This included:

i) Adding more commands to the mummer_cmds_four fixture. We now run pairwise comparisons in both directions (forward and reverse). This adjustment allowed us to pass the test_mummer_multiple and test_mummer_job_generation tests successfully.

(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_multiple
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item                                                                                                                                          

tests/test_anim.py::test_mummer_multiple PASSED                                                                                                     [100%]

==================================================================== 1 passed in 0.12s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %

(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_mummer_job_generation
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item                                                                                                                                          

tests/test_anim.py::test_mummer_job_generation PASSED                                                                                               [100%]

==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %

ii) Modification of the deltafile_parsed fixture. We now expect different values to be returned by the parse_delta function. This change is due to adjustments made in the codebase, where we no longer double count overlapping regions, calculate weighted ANIm %ID, and return additional items such as query alignment length, subject alignment length, weighted %ID, and similarity errors. This allowed to pass test_deltafile_parsing test.

(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani % pytest -v tests/test_anim.py::test_deltafile_parsing    
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.8.18, pytest-8.1.1, pluggy-1.4.0 -- /opt/anaconda3/envs/pyani_issue_421/bin/python3.8
cachedir: .pytest_cache
rootdir: /Users/angelikakiepas/Desktop/pyani
configfile: pytest.ini
plugins: anyio-4.3.0, cov-4.1.0, ordering-0.6
collected 1 item                                                                                                                                          

tests/test_anim.py::test_deltafile_parsing PASSED                                                                                                   [100%]

==================================================================== 1 passed in 0.13s ====================================================================
(pyani_issue_421) angelikakiepas@Angelikas-MacBook-Pro pyani %

kiepczi commented 8 months ago

Although the majority of tests pass, three are currently failing:

================================================================= short test summary info =================================================================
FAILED tests/test_anim.py::test_genome_sorting - AssertionError: assert ('nucmer --mu...first.filter') == ('nucmer --mu...econd.filter')
FAILED tests/test_parsing.py::test_anim_delta - AssertionError: assert (4016947, 401...7031752, 2191) == (4016947, 401...4447228, 2191)
FAILED tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani - TypeError: 'aln_length' is an invalid keyword argument for Comparison
======================================= 3 failed, 115 passed, 1 xfailed, 1 xpassed, 3 warnings in 389.17s (0:06:29) =======================================

Here are my current thoughts on them:

The tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani fails because there is no longer a all_length keyword. This was expected, as we modified the codebase to run pairwise comparisons in both directions (forward and reverse). Because of this, we have decided to change how the alignment lengths are now calculated and reported. The aln_length keyword in the Comparison class was replaced with query_aln_length and subject_aln_length. I would need a little more investigation of the fastani approach to come up with the best solution for this.
The tests/test_anim.py::test_genome_sorting is now failing because, by modifying the codebase to run both comparisons (reverse and forward), I have never taken sorting into consideration. I think restoring sorting would be beneficial as it would help us obtain consistent and reproducible results for future testing and easier debugging.
The tests/test_parsing.py::test_anim_delta fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.

@widdowquinn, if you have any ideas comments regarding this, let me know.

widdowquinn commented 8 months ago

2. The tests/test_anim.py::test_genome_sorting is now failing because, by modifying the codebase to run both comparisons (reverse and forward), I have never taken sorting into consideration. I think restoring sorting would be beneficial as it would help us obtain consistent and reproducible results for future testing and easier debugging.

It would be worth looking back through the commit logs to be sure, but my memory is that sorting was introduced because…

@peterjc brought in a separation of nucmer_output into subdirectories, for easier file management (a good thing)
The original, unsorted, one-way anim comparison carried out one pairwise comparison which may have gone into A/A_vs_B or B/B_vs_A - somewhat unpredictably.
We wanted a deterministic, predictable placing of those files, and sorting inputs was the way we decided to do it.

Now, with anim that does both A vs B and B vs A I don't think we need to sort, any more. The last commit I made above skips a sorting test in test_anim.py - you might like to use that as a template for how to handle the test_genome_sorting() test, here.

widdowquinn commented 8 months ago

3. The tests/test_parsing.py::test_anim_delta fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.

Thing 1 is to satisfy ourselves that that's the origin of the discrepancy ;) If it is, then the job of implementing that part of pyani dnadiff is slightly easier than we at first thought…

widdowquinn commented 8 months ago

The tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani fails because there is no longer a all_length keyword. This was expected, as we modified the codebase to run pairwise comparisons in both directions (forward and reverse). Because of this, we have decided to change how the alignment lengths are now calculated and reported. The aln_length keyword in the Comparison class was replaced with query_aln_length and subject_aln_length. I would need a little more investigation of the fastani approach to come up with the best solution for this.

One solution here is - essentially - polymorphism. We would declare new classes that describe output from distinct tools (e.g. DnadiffResult, NucmerResult, FastANIResult), and have the ORM function process them differently. That's probably the most elegant and explicit way to do it, rather than try to overextend the function that's being called. We should talk about that in person and get something down in a document.

kiepczi commented 8 months ago

The tests/test_parsing.py::test_anim_delta fails because of the rounding error, which, as @widdowquinn suggested, should be investigated a little bit more.

Thing 1 is to satisfy ourselves that that's the origin of the discrepancy ;) If it is, then the job of implementing that part of pyani dnadiff is slightly easier than we at first thought…

I looked into how dnadiff.pl calculates the Average %ID for 1-to-1 alignments.

To begin, it assigns the combined alignment lengths (rqSumLen1) and the sum of weighted alignment length identities (rqSumIdy1) to 0.

    my ($rqSumLen1, $rqSumLenM) = (0,0);    # combined alignment length sum
    my ($rqSumIdy1, $rqSumIdyM) = (0,0);    # weighted alignment identity sum

The weighted alignment lengths are then calculated by dividing the average %ID of each sequence by 100 and multiplying it by the sum of the lengths of both the reference and query sequences (note that these values are extracted from the 1coords file).

    $rqSumIdy1 += ($A[6] / 100.0) * ($A[4] + $A[5]);

The length of alignments is calculated as the sum of the lengths of both the reference and query sequences:

$rqSumLen1 += ($A[4] + $A[5]);

The Average Identity is calculated by dividing the sum of all weighted alignment lengths by the sum of all combined alignment lengths and then multiplying that value by 100.

($rqSumLen1 ? $rqSumIdy1 / $rqSumLen1 * 100.0 : 0);

I manually calculated this for a small comparison of two viral genomes, where according to dnadiff (1-to-1 alignments), the average %ID is 99.63:

[Alignments]
1-to-1                             2                    2
TotalLength                    59174                59187
AvgLength                   29587.00             29593.50
AvgIdentity                    99.63                99.63

The 1coords file:

85  37713   1   37636   37629   37636   99.43   39253   39594   95.86   95.05   MGV_MGV-GENOME-0264574  MGV_MGV-GENOME-0266457
17709   39253   17626   39176   21545   21551   99.97   39253   39594   54.89   54.43   MGV_MGV-GENOME-0264574  MGV_MGV-GENOME-0266457

Then, the calculations I get are: alignment weighted 1 = (99.43 / 100) (37629 + 37636) = 74835.9895 alignment weighted 2 = (99.97 / 100) (21545 + 21551) = 43083.0712 Average %ID = (74835.9895 + 43083.0712) / (37629 + 37636 + 21545 + 21551) * 100 = 99.6266174669021 ≈ 99.63

I attempted to replicate this value using delta files instead, but as we know, these were slightly different, usually at the second decimal place. We discussed earlier that this could be due to the rounding error coming from the %ID of each individual sequence given in the 1coords file.

So, let's calculate these values without rounding the individual sequence %IDs from the delta file. The file is as follows:

/Users/angelikakiepas/Desktop/pyani/issue_421/rounding_error/scripts/../data/donovan_test/input/MGV-GENOME-0264574.fna /Users/angelikakiepas/Desktop/pyani/issue_421/rounding_error/scripts/../data/donovan_test/input/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0

For the first alignment calculations: aligned reference bases = 37713 - 85 + 1 = 37629 aligned query bases = 37636 - 1 + 1 = 37636 similarity errors: 215 percentage id = ((37629 + 37636) - (2 215)) / (37629 + 37636) 100 = 99.42868531189795 alignment weighted 1 = (99.42868531189795 / 100) * (37629 + 37636) = 74834.99999999999

For the second alignment calculations: aligned reference bases = 39253 - 17709 + 1 = 21545 aligned query bases = 39176 - 17626 + 1 = 21551 similarity errors: 7 percentage id = ((21545 + 21551) - (2 7)) / (21545 + 21551) 100 = 99.96751438648599 alignment weighted 2 = (99.96751438648599 / 100) * (21545 + 21551) = 43082.0

Average %ID = (74834.99999999999 + 43082.0) / (37629 + 37636 + 21545 + 21551) * 100 = 99.62487643733999

@widdowquinn suggested that there could be a quicker way to calculate this by skipping the intermediate calculation of identity as follows:

alignment 1 identity weighted = (37629 + 37636) - (2 215) = 74835 alignment 2 identity weighted = (21545 + 21551) - (2 7) = 43082 Average %ID = (74835 + 43082) / (37629 + 37636 + 21545 + 21551) *100 = 99.62487643734

This example demonstrates that the differences in the reported numbers could arise from a rounding error, and it appears that our calculations would be more accurate.

I have written a Python script that calculates these values using these approaches (available here). I ran this on a few examples and found that in some cases, even when we calculate the Average %ID where the individual aligned sequences %ID is rounded, the values differ from the ones reported by dnadiff.

When the values are calculated in Python, I obtain the following results:

85.24 = Follows the approach implemented by dnadiff (percentage identities for each found alignment were rounded to the 2nd decimal place).
85.24244273445755 = Follows the approach implemented by dnadiff, but no %ID for each found alignment were rounded to the 2nd decimal place.
85.24244273445755 = Skipping the intermediate calculations.

The average %ID reported by dnadiff is 85.35%, so the difference is quite significant:

[Alignments]
1-to-1                          1455                 1455
TotalLength                  1679781              1679569
AvgLength                    1154.49              1154.34
AvgIdentity                    85.35                85.35

I checked if the sequence percentage identities calculated for each alignment match the ones reported in the 1coords file. To my surprise, I found that the majority of these do not match.

One example where the percentage identity for the alignment does not match is:

1486    3045    124165  125722  1560    1558    88.49   16110   7787608 9.68    0.02    NZ_NOWW01000027.1   NZ_CP026730.1

Using the above information, I found the alignment in the delta file:

>NZ_NOWW01000027.1 NZ_CP026730.1 16110 7787608
1486 3045 124165 125722 180 180 0
8
-53
493
-4
743
-18
3
123
-6
33
0

When calculating the percentage identity manually, I found that these do not match. In this example, the difference is almost 0.035%. As this is happening in most of the alignments in this comparison, the differences add up to the point that the overall Average %ID differs significantly.

Reference alignment length = 3045 - 1486 + 1 = 1560 Query alignment length = 125722 - 124165 + 1 = 1558 Percentage ID = ((1560 + 1558) - (2 180)) / (1560 + 1558) 100 = 88.45413726747915

kiepczi commented 8 months ago

I should probably mention that all code and data used for the investigation can be found here.

widdowquinn commented 8 months ago

After discussion between @kiepczi and me, and inspecting the code of show-coords, it appears that - in some cases at least - the graph model internal to show-coords is leading to a different measure of sequence identity than would be obtained from our (naive) reading of the .delta files.

We cannot then guarantee that our pyani-calculated percentage identity matches for any particular region exactly match those from dnadiff, or that they will match within simple rounding error (although in many cases they actually do so).

At this point, my view is that the pyani approach of considering all alignments in the .delta file, and the dnadiff approach of (apparently) constructing a graph model from those alignments are not straightforward to reconcile.

The least time-consuming, and probably most useful, way forward is, I think, this:

Provide the (corrected) pyani calculation method as the standard method for pyani anim
Provide a second pyani dnadiff method that generates the show-coords and show-diff outputs using MUMmer's tools, allowing us to recreate the key parts of the dnadiff output.
Using pyani compare make a thorough comparison between the operation of both methods for the resulting publication.

widdowquinn commented 8 months ago

Adding the pyani dnadiff command should be reserved for pyani plus, as per #424.

widdowquinn commented 8 months ago

One solution here is - essentially - polymorphism. We would declare new classes that describe output from distinct tools (e.g. DnadiffResult, NucmerResult, FastANIResult), and have the ORM function process them differently. That's probably the most elegant and explicit way to do it, rather than try to overextend the function that's being called. We should talk about that in person and get something down in a document.

To expand on this… the error below

FAILED tests/test_subcmd_09_fastani.py::TestfastANISubcommand::test_fastani - TypeError: 'aln_length' is an invalid keyword argument for Comparison

arises because the keywords provided from a fastANI comparison are no longer the same as for an anim comparison. Commit 693d0c2 changed the requirement from aln_length to query_aln_length and subject_aln_length, though it is not clear that the underlying database schema changed (we need to check this and update to accommodate if needed!).

If we require only one of those two lengths for the ANI result, then we should prefer one over the other, and we ought to be able to provide it to the Comparison object through a consistent interface. For example…

class ResultfastANI()
    [...]
    self.aln_length = #total alignment length as only one length makes sense
    [...]
    @property
    def mylen():
        return self.aln_length

class ResultNucmer()
    [...]
    self.query_aln_length = #self-explanatory
    self.ref_aln_length = #self-explanatory
    [...]
    @property
    def mylen():
        return self.query_aln_length # because we are prioritising the _query_ alignment length by choice

But, if we require two lengths, then we will have to change the schema/reporting/ORM to accommodate this.

Which do we think we want?

widdowquinn commented 8 months ago

If we only require one alignment length per comparison for our reporting purposes, then the first solution is sufficient.

Having both results from a tool like nucmer, which reports query and reference alignment lengths (perhaps differently for A v B and B v A comparisons), might be interesing to some, but changing the internals is a large task, and would be better postponed to pyani-plus where it can be accommodated in a schema redesign.